Abstract
At StackPulse we use a full CI/CD pipeline with FluxCD + Flagger, in order to support our CD culture needs. We also developed Puerta — a homegrown gating service that implements the Flagger webhook phases to support time-based approvals, triggering E2E jobs (CircleCI etc.), load-testing and auto release to production once canary passes on staging.
In this session, we’ll discuss why and how we built Puerta and dig into three key areas:
- How to customize your CD workflow to fit your needs and culture
- How to empower developers so they can quickly deploy their code securely
- How we dedicated time and resources to developing this internal CD service
Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everybody, my name is Azad Rudyk and I'm the director of engineering
at Stackpools and I'm here with Oralimelev, our SRE
lead. And today we areas going to discuss Puerta, which is a gating
service that we created for our Kubernetes native
continuous delivery. What we are going to discuss today first we'll start
with a quick introduction to Stackpools. After that we will discuss
a little bit more about our delivery pipeline and we will
dig into later about our custom gates. We will
explain how they created and how they support our culture.
And after that we will describe Puerta, which is the service that
handles those gates and support our pipeline. Okay, so about
Stackpools. So Stackpulse is a fairly new
startup that creates we create a SaaS platform
for sres and for reliability in general. We call that
reliability as code. We digest many
events coming from monitoring systems and we enable sres
to automatically respond to those events by
executing an automation that we call playbooks.
Playbooks help investigate and remediate events
and resolve incidents automatically without any manual intervention
and therefore reaching a faster resolution, a safer resolution
and quicker response. A bit about our tech stack
at Stackpulse we leverage Google Cloud platform as our cloud
provider, and we heavily relied on Kubernetes to
deploy our services. Particularly we use GKE,
which is the managed solution in GCP. We strongly
believe in immutable infrastructure. So we have terraform code
that describes all our infrastructure as code.
And probably, as you guessed, we are cloud native architecture.
We have microservices and we have modern RPC in
the communication between those microservices.
And as the context of this talk, we have
a full CI CD from a merge to the main branch
up to production automatically without any human intervention
in between. And that's the context of this talk,
and we'll dig a little bit about that in the next slides.
Okay, so with that I let Orr explain and discuss our pipeline.
Thank you Rudik. I'm Orr, the SRE lead at Stackpulse,
and I'll take it from here. Let's talk about the
CD pipeline. At Stackpulse we use fairly common
infrastructure. We use GitHub as our hosting for code
repositories and circle CI for the
CI pipelines. We are using with lego
bricks, with Fluxcd and flugger and circle,
and connecting all of it together with Puerta,
which I'll discuss in a bit. For the CI part we have
GitHub, which kicks a circle CI job for each commit,
then the circleci. A successful job ends
up with Docker and OCI image pushed to a
registry. In the CD part we have flux,
which is for those who don't know, Flux is a GitHub
toolkit which listens to two resources.
One of them is git for configuration, the other is
container registry, in our case, Google container
registry for new artifacts and images.
We push the image to the registry,
flux then refreshes its cache, discovers the
image, applies it to the new canary flagger,
then recognize that and triggers a new canary
pipeline. The way we extend flagger is through webhook
as you see listed here, all the webhook stages that
flagger supports and we leverage and implement
those in Puerta. The custom gates are implemented
in Puerta like I said, and this is how we extend
flagger and support internal tooling
and the organizational culture and structure we use.
Let's discuss deeper so what happens to
a commit HPR get merged to the main
branch, then CI system triggers a
build on staging on the main branch.
Flux then updates the workload for the canary flagger
triggers a new canary pipeline, then we
have pre rollout triggers our e two e and
waits for the e two e job to finish.
During rollout, we use the built in
metrics for flagger. It queries
our Prometheus monitoring system for success rate
and latency. We fine tune each canary
metrics depending on its slos
and if we stray away from the
success rate and latency we set to
reach in our service, we fail the rollout
and roll back the canary deployments and shift all
the traffic back to the priming. We then have a post rollout.
On a successful rollout on Flagger, we implements the
post rollout webhook which creates a new release
on GitHub and then all the things happens
again the same way on prod. So a new
CI built triggers on the new release.
Fluxity then updates the workload on production
flagger, triggering a new canary pipeline on production and
again and again. So why do we need custom gates?
Let's discuss. We want to deliver a value
for our customers since we are a new startup and we want
to impact customers as fast and as reliably
as we can. The way we do this is with gating.
We make developers feel comfortable with
pushing all day every day and relying
on a gating service to gate and fence
their failures and mitigate bad
releases from reaching production. We want to support our
organizational culture. We strongly believe in
great engineering culture throughout our organization.
The CD is not an exception for this.
We want to have everything support
our organizational culture and we do
this with gating only e two e tested flows
which the e two e are tested by the
point of view of the user. Which means we can catch bugs that
developers may missed during unit test and
integration test and we might catch those in the
gating and then prevent from those
who reach production. Visibility is a crucial part of pipeline.
Developers want to know what's the state and phase
of their deployment and where it stands. Is it
in prod yet or not? Can I check, can I reach the code on
prod? Can I test it? Can I check that everything that I
tested in dev is actually working as expected
or not. We strongly believe in full ownership
of developers from dev to production. You build
it, you run it, you are the owner of the commit,
you make sure it reaches staging,
it behaves correctly, you test it, you write
the e two e test, you write the integration and unit test and
make sure everything reaches in a safe manner to production.
Another benefit of having a gate is we
can gather all the events happening in flagger and keep
an audit trail in logs and store them for a
longer time. Plus having them in a central
channel. We can follow when and
where was a release and we can use
it for compliance reasons. So let's
talk about the gates a bit. We have confirmed
rollout webhook implemented in Puerta. We strongly
believe in reliability since we areas a reliability platform.
And what we want from developers is actually
work when they feel comfortable remotely or
in the office or at night where they reach peak
performance. We don't want to block them from merging their code
when it's ready. We believe in small code changes and
prs and merging constantly.
The PRS is crucial part. The thing is everything is
automated. So we don't want someone
merging at midnight their commit to reach
prod. So we actually gate them from
reaching production in hours. People cannot
attend their code reaching production without waking
up the on caller and stuff. We are making sure
the rollout happens during work hours. We have
12 hours during the day that everyone can attend and
actually answer on call regarding bad deployments.
So it's actually guarding developers
from having mistakes without
intention at night. And we keep the code
chunks smaller and much more reliable
and easier to read instead of having a
big chunk of code reaching production in
every time of the day. So we use the confirmable
rollout webhook to simulate release trains
and we areas queuing the releases from reaching
production to certain hours.
Here is can example. So the canary is waiting.
This is Friday. It's weekend here in Israel.
People don't want to wake up from a
release happening during the weekend, and this release
will queue up until Sunday morning and
will let hard devs have their
weekend with their families, et cetera.
So the next gate is the e two e execution.
We use the pre rollout webhook
to trigger a CI job on each newly deployed
Canary. If the job fails, we fail
the entire canary without actually impacting prod.
The pre rollout prevents traffic from getting to the
new canary and it stops before even propagating
a single percent of traffic. We use playwright,
which is an etc infrastructure for UI tests and
our entire APIs are GRPC based. So we
enjoy the fact that we get generated
clients for each API and we leverage
these generated clients in the e two e to trigger
both UI and API tests along
the same job. This one helps us simulate
a user accessing our systems from the
CLI, the API or the UI,
and helps us catch bugs which we couldn't find or
catch beforehand. This adds another layer of
protection and makes developers
feel much comfortable when pushing code. Next gate
is crucial it seems trivial to have
stack notifications for each phase, but what
we had at first is we use
the built in notifications from Flagger,
which are lacking. They're not that verbose. So we
added an event webhook in Puerta,
which took every event verbosely and pushed it
to a central channel. What we then realized is
developers were complaining about the noise in the central
channel. They have very verbose messages
about all the microservices we deploy at
Stackpulse for dev and staging, and everything was concentrated
into the same channel. We then decided to actually keep
the channel and copy the messages in order to
be able to audit and transparency in the organization and
let everyone know what happened when, so we can correlate
between incidents and deployments. But we wanted
the developers to know where they stand, in which
phase their code and commit is at, and know
where exactly they stand and whether their code
reached production or staging, and whether it failed.
And then if it failed, they can
go back and probably fix it and know why
it failed. Maybe 500, maybe something in
the server, maybe something in the e two e broke it
and they broke the contract and they go and fix it.
The feedback loop is much shorter and the
DM is more curated at the developer
and helps the developer identify bugs
and stuff earlier in the pipeline. So I'll give it
back to Rudik. I think we're done.
If everyone want to know another they have questions
regarding Flager and Puerta. They can reach me.
So let's sum up. At Stackpulse we create a SaaS platform for
sres and for reliability in general. We relied
heavily on our continuous deployment pipeline to do that safely and to
deliver fast value for our customers. To do
that, we use Flager, which is a very common solution in
that field. But we had to extend the built in functionality
of flagger using a custom tool that we created, which is which
we call Puerta. And Puerta has custom gates that support
our own needs and our own organizational culture. As Tor
mentioned, we have e to e there and we have notifications
there and many other things, and that's what helps us to achieve that
fast value and fast and safe feedback loop.
So it help us a lot to extend that functionality and
get state of the art continuous deployment pipeline thank you
so much for watching our session on Puerta, a gating service for Kubernetes
native CD. Please feel free to reach out on Twitter and ask us additional
questions. We'd love to hear from you.