Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone, I'm Lucas, and today I'm going to talk to you about how to
design staging environments. Not only staging environments, pre production
environments in general. And before we get started,
I just thought I'd give you a brief intro about myself.
So I'm the founder of Ergomake where@ergomake.com dev
and we work a lot with these preproduction and these
staging environments. So all this content comes from what
we've learned dealing with users that want to set up
those kinds of environments. I'm also an author, so ive written
testing JavaScript applications, which is a book published by many.
In the past I've been a core maintainer of Chi Js and Silent JS,
and I've contributed to many other open source projects, including Jest,
which many of you may use if you use JavaScript.
Previously before founding Ergomake, I was a senior software engineer
at Elastic who does elasticsearch.
So without further ado, I want to get started and talk about
staging environments. And to start, I'd like to explore why people
use staging, why people use preproduction environments today,
and why these environments matter. So today most
people use staging for one of these three reasons.
First, development. Because individuals can't easily run
their whole app on their machines, so they will point to staging if they
cannot do that, if they need to run 1000 containers, if they have to have,
I don't know, many queues running and things like that,
they usually cannot run all that on their machines, so they use staging
environments and just run one application on their machine.
The second reason is for testing, because each retests and synthetics
tests they need to run somewhere, and a developer's machine is
usually not the ideal place to do so. Furthermore,
a staging environments may not be representative enough of production.
And the third and final reason is validation. Because nontechnical
folks, designers and products and product people, usually cannot run
the whole app on their machines because they lack the technical expertise or
they lack tools, and they need to do acceptance tests somewhere
so that they can provide feedback to developers during
these three activities. Managing preproduction in general serves
three main purposes. The first is
to ensure that features match specs. The second
is to perform representative tasks and catch bugs before they get to
preproduction. And the third is for sharing context among
technical and non technical team members. So when
you have one of these environments, product can see it, QA can see it,
designers can see it, not only devs can see it, as usually happens.
What usually happens is that devs do a lot of work on their machines.
They review the code, they send it to stage or to production,
and only then these people see it. So if you have these pre production
environments, people can see it early, before things get to production.
But there are better ways to get all of these benefits
than what we currently do. So what
are we doing wrong today? Well, there's plenty of challenges
that currently happen. The first is on the left
here is that we have excessive technical complexity.
This complexity usually comes in the form of opaque CI
pipelines, complex helm shards,
terraform files, and many other infrastructure tools that developers don't usually
understand. And many times even infra people don't fully
understand them. This complex infrastructure then
leads to another problem. Developers cannot spin up preproduction
environments for themselves, so they end up depending on these
operations and infrastructure teams. And this dependency
leads to a lot of wasted time, long cycle times,
and very often bugs because developers don't understand the
underlying tooling and infrastructure. And I know
we've been talking a lot about DevOps, but without
empowering developers, without teaching them how infrastructure works,
we cannot get there. It is pretty difficult to understand
all those things. Another problem is that
most people today, they have a single staging environment, and that
happens for a myriad of reasons. Maybe staging is too
expensive if it's representative of production,
and maybe the complex infrastructure means it's not easy
to reproduce staging on demand whenever developers need it.
If that's the case, the developers who wrote code pointing to
managing end up got being able to work if
staging is down or misbehaving, so managing goes down. Everyone that
has to develop against staging is not able to work.
For designers, a problem with staging is that there's a lack of visibility
into what's being done before changes are merged and deployed.
So when feedback comes, it's usually too late and too expensive to
change things. For qas, in addition
to having to wait for the app to be deployed before they can test,
they often have to do with messy data or dirty state, which makes
their tests unrepresentive. So imagine you have
the single staging environment and designers are using it to validate
whether things look like they should. Developers are developing against it.
They are sending a bunch of fake data there. If that's the case,
maybe QA will do something and they won't know whether they've done something
wrong, whether there's a bug or whether it was just someone
messing up with state in staging so they can test things.
So it's very difficult for qas to properly analyze the code
that's there without proper state, without proper data.
And finally, product managers may also have to wait for
developers to deploy the change that PO needs to see,
and then they have to juggle those dependencies themselves. So let's say
developer a is developing a feature and developer b is developing a feature that
impacts feature a in some way. If that's the
case, both developers cannot deploy the changes at the same time. They have to do
a conflict, and until they resolve those conflicts and send it to
preproduction product cannot see it because there's only
staging environments. So there's again this bottleneck not
only for deploying to production, but also for validating things from
a product perspective.
Okay, so now that we understand these problems,
how do we actually solve them?
The first step is to eliminate the bottleneck and empower developers
developers to spin up the environments at will. So instead
of having a single environment that's always there, you just tell developers whenever
you need, you can get an environment.
For that. It is essential to have simple inference code setups and pipelines
that are easy for developers to understand.
And for that, SRes and DevOps engineers.
I don't really like the term. I think of DevOps as a coach or not
as a role. But anyway, let's ignore that for now.
So for that, these people, they need to create a good set of abstractions
for the developers to use. Those abstractions may sometimes
even simplify the staging environments you deploy, given it
doesn't always need to be exactly like production.
Furthermore, for those to work, it is important to embed
an infrastructure expert into the team to provide support
and empower developers. Because as I've mentioned before,
all these infrastructure tools,
they're really complex and you cannot simply rely. The developers
will have time to understand all of that. So by embedding
this expert into the team, they can kind
of be like as a service person,
as a consultant for those devs to be able to manage that
infrastructure. One way that people do this currently is
by having platform engineering teams that provide good APIs
for developers. But if that's all your case, if you don't have a platform engineering
team, it's perfectly fine to embed a DevOps person into
the development team. With that you will be
able to have these multiple environments on demand and
cut the intercheam dependencies.
Now something you're all probably thinking about is that having
too many environments sounds expensive, and it usually is.
Well, because of that, it is
important to allocate a resource pool for teams
that not only prevents you from worrying about excessive costs, but also
empowers the team to use the resources, however they find it more
useful. There are many ways of doing that, but one easy way,
especially if you go multicloud or if you run many clusters everywhere,
is to use budgeting features of kubesphere, which I
quite like, and they're quite easy to set up, as shown here.
If you do that, all of a sudden you don't have to worry about things
being too expensive anymore, because you're going to cap the use of the team.
And the team won't have to escalate whenever they need to buy something
on AWS, whenever they need a new instance or things like that.
So you empower the team to move fast
and remember that your environments, they don't necessarily have to
be too similar to production. Depending on what you're doing,
you can have lighter weight environments, like when developing or validating the
product. And sometimes you can have functional equivalents,
even for testing, unless you're doing benchmarking or something that depends on
representative inference. So the way I think about this is that
for development we need functional equivalents.
So the queues, the services that
you depend on, they need to be running so that you can point to staging,
point to pre production, and then you can do your dev there and you can
be reasonably confident that things will work as they should. For validation,
product designers and qas, they need some kind of guarantee that
things will be similar in terms of performance and in terms of
overall functioning as they would be in production. And for those
tasks, these people don't usually put much load into the system,
so you can scale them down and you will still have performance
equivalence. Finally, for testing, infrastructure equivalents
is important so that you catch bugs that you'd only catch
when you have a certain pool running. So the important thing here is to
understand that even though you can have an equivalent infrastructure,
by equivalent, I don't mean equivalent to says that you're
going to have 1000 instances, that you're going to offer lots of
benefits in terms of performance. Many of these bugs can be called with
much less infrastructure than people usually have.
Another thing I recommend doing is to generate a representative fake data
set for production database schemas and integrate that into your
staging environments deployment pipeline. That way
every new environment gets a fresh set of data, which is super
useful not only for qas, devs and other folks to do representative
test, but also to validate migrations will work
appropriately, which is something we've seen quite a got many, many people that come to
talk to us at Ergomake, they have this problem. They have migrations,
and they are not sure that their migrations are not going to destroy some kind
of important data. So they want to have these environments, so that the migration
runs there, and they want to have some representative data there
to have a good guarantee that things are not going to break when those migrations
go to production. And if you want to do that,
it is quite easy to write a small script, to read your schemas and integrate
it with libraries like Faker. And while you can
guess what data goes there and generate it on the fly by looking at the
schema. So use Faker to remove any PII, to remove
any data that's sensitive, and then you put it there and then you have a
good guarantee that that data is going to be similar enough of production
to catch bugs with migrations.
Now, to solve the late feedback problem, something we've talked about
quite a lot is about empowering known technical users
and shift left on their contribution. And I
usually like to do that by creating a directory where these people can get links
to all these environments and access them at any time. Usually the
pull request is the best place to have that. So tools like Roselle,
if you're using xjs and ergomake itself, can spin
those up for you automatically. So now all of a sudden, instead of those people
having to go to staging, validating things there, and not knowing even what's
there, these people, they can review code functionally
by clicking a link on a pull request. And just remember, you may
want to put some pages of these previews behind some
type of auth to avoid, for example, features being revealed
to the Internet or crawlers indexing them.
Finally, another very important thing is to have some kind of
observability set up, so that when product people or designers
or QA is doing some tests, you get alerts and you
see that something went wrong with metrics or somehow.
So if you just deploy a promptio server or some kind of alert manager which
sends messages to slack, that's ideal, because many times,
even though things will work okay for these folks,
you will not see an error thrown in the backend, which may be something critical.
Synthetics and e three tasks are also a great thing to have running against these
environments, because if you have production,
fine, you're going to catch errors, you're going to get alerts when things got wrong
because there's people using it. But for pre production there's usually
got as much load, so you can kind of preproduction these workflows with synthetics tests
running against these environments all the time. And you can also save production
from having to be the target of etest.
So you can run ET test separately isolated against each of
these environments. And it's also important to say that
you won't be able to catch all possible problems with these environments.
Their main utility is to shift left on as many stages
of the development lifecycle as possible. Therefore,
you should really focus on having a reliable production environment
and have feature flags to avoid holding things up and got deploying
if something cannot be deployed. So you just switch a feature flag
on or off, and then you can continue deploying even if
something escapes this review process.
This shift left validations.
So to summarize, use on demand environments,
not just one. Empower developers to spin up environments,
manage resources on a team basis,
use fresh copies of data for staging environments,
make preproduction environments observable, run periodic tests against
those environments, have different profiles for preproduction
like lightweight and heavyweight, and do not allow
preproduction to become a bottleneck. So for Q
A, you can either send me an email at Lucas Costa at
got ergomake.com or you can go to Ergomake Dev and you
will find some links there where you can talk to us on
Twitter. I'm at the Wizard Lucas. You can also send me a message there,
right? Thank you very much. It's been a pleasure.