Designing Staging Environments: A Blueprint for Pre-production

Video size:

Abstract

In this action-packed talk, I’ll show why today’s “staging” is less than ideal and teach you simple yet effective strategies for smooth rollouts. From setting up stable pre-production environments to nailing testing, managing state, and collaborating with PMs and QA, I’ve got you covered!

Summary

Lucas is the founder of Ergomake where he works with preproduction and staging environments. Previously he was a senior software engineer at Elastic who does elasticsearch. Lucas is also an author who has written testing JavaScript applications.
Most people use staging for three reasons. The first is to ensure that features match specs. The second is to perform representative tasks and catch bugs before they get to preproduction. The third is for sharing context among technical and non technical team members. But there are better ways to get all of these benefits than what we currently do.
It is important to allocate a resource pool for teams that not only prevents you from worrying about excessive costs, but also empowers the team to use the resources. Depending on what you're doing, you can have lighter weight environments, like when developing or validating the product. For testing, infrastructure equivalents is important.

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, I'm Lucas, and today I'm going to talk to you about how to design staging environments. Not only staging environments, pre production environments in general. And before we get started, I just thought I'd give you a brief intro about myself. So I'm the founder of Ergomake where@ergomake.com dev and we work a lot with these preproduction and these staging environments. So all this content comes from what we've learned dealing with users that want to set up those kinds of environments. I'm also an author, so ive written testing JavaScript applications, which is a book published by many. In the past I've been a core maintainer of Chi Js and Silent JS, and I've contributed to many other open source projects, including Jest, which many of you may use if you use JavaScript. Previously before founding Ergomake, I was a senior software engineer at Elastic who does elasticsearch. So without further ado, I want to get started and talk about staging environments. And to start, I'd like to explore why people use staging, why people use preproduction environments today, and why these environments matter. So today most people use staging for one of these three reasons. First, development. Because individuals can't easily run their whole app on their machines, so they will point to staging if they cannot do that, if they need to run 1000 containers, if they have to have, I don't know, many queues running and things like that, they usually cannot run all that on their machines, so they use staging environments and just run one application on their machine. The second reason is for testing, because each retests and synthetics tests they need to run somewhere, and a developer's machine is usually not the ideal place to do so. Furthermore, a staging environments may not be representative enough of production. And the third and final reason is validation. Because nontechnical folks, designers and products and product people, usually cannot run the whole app on their machines because they lack the technical expertise or they lack tools, and they need to do acceptance tests somewhere so that they can provide feedback to developers during these three activities. Managing preproduction in general serves three main purposes. The first is to ensure that features match specs. The second is to perform representative tasks and catch bugs before they get to preproduction. And the third is for sharing context among technical and non technical team members. So when you have one of these environments, product can see it, QA can see it, designers can see it, not only devs can see it, as usually happens. What usually happens is that devs do a lot of work on their machines. They review the code, they send it to stage or to production, and only then these people see it. So if you have these pre production environments, people can see it early, before things get to production. But there are better ways to get all of these benefits than what we currently do. So what are we doing wrong today? Well, there's plenty of challenges that currently happen. The first is on the left here is that we have excessive technical complexity. This complexity usually comes in the form of opaque CI pipelines, complex helm shards, terraform files, and many other infrastructure tools that developers don't usually understand. And many times even infra people don't fully understand them. This complex infrastructure then leads to another problem. Developers cannot spin up preproduction environments for themselves, so they end up depending on these operations and infrastructure teams. And this dependency leads to a lot of wasted time, long cycle times, and very often bugs because developers don't understand the underlying tooling and infrastructure. And I know we've been talking a lot about DevOps, but without empowering developers, without teaching them how infrastructure works, we cannot get there. It is pretty difficult to understand all those things. Another problem is that most people today, they have a single staging environment, and that happens for a myriad of reasons. Maybe staging is too expensive if it's representative of production, and maybe the complex infrastructure means it's not easy to reproduce staging on demand whenever developers need it. If that's the case, the developers who wrote code pointing to managing end up got being able to work if staging is down or misbehaving, so managing goes down. Everyone that has to develop against staging is not able to work. For designers, a problem with staging is that there's a lack of visibility into what's being done before changes are merged and deployed. So when feedback comes, it's usually too late and too expensive to change things. For qas, in addition to having to wait for the app to be deployed before they can test, they often have to do with messy data or dirty state, which makes their tests unrepresentive. So imagine you have the single staging environment and designers are using it to validate whether things look like they should. Developers are developing against it. They are sending a bunch of fake data there. If that's the case, maybe QA will do something and they won't know whether they've done something wrong, whether there's a bug or whether it was just someone messing up with state in staging so they can test things. So it's very difficult for qas to properly analyze the code that's there without proper state, without proper data. And finally, product managers may also have to wait for developers to deploy the change that PO needs to see, and then they have to juggle those dependencies themselves. So let's say developer a is developing a feature and developer b is developing a feature that impacts feature a in some way. If that's the case, both developers cannot deploy the changes at the same time. They have to do a conflict, and until they resolve those conflicts and send it to preproduction product cannot see it because there's only staging environments. So there's again this bottleneck not only for deploying to production, but also for validating things from a product perspective. Okay, so now that we understand these problems, how do we actually solve them? The first step is to eliminate the bottleneck and empower developers developers to spin up the environments at will. So instead of having a single environment that's always there, you just tell developers whenever you need, you can get an environment. For that. It is essential to have simple inference code setups and pipelines that are easy for developers to understand. And for that, SRes and DevOps engineers. I don't really like the term. I think of DevOps as a coach or not as a role. But anyway, let's ignore that for now. So for that, these people, they need to create a good set of abstractions for the developers to use. Those abstractions may sometimes even simplify the staging environments you deploy, given it doesn't always need to be exactly like production. Furthermore, for those to work, it is important to embed an infrastructure expert into the team to provide support and empower developers. Because as I've mentioned before, all these infrastructure tools, they're really complex and you cannot simply rely. The developers will have time to understand all of that. So by embedding this expert into the team, they can kind of be like as a service person, as a consultant for those devs to be able to manage that infrastructure. One way that people do this currently is by having platform engineering teams that provide good APIs for developers. But if that's all your case, if you don't have a platform engineering team, it's perfectly fine to embed a DevOps person into the development team. With that you will be able to have these multiple environments on demand and cut the intercheam dependencies. Now something you're all probably thinking about is that having too many environments sounds expensive, and it usually is. Well, because of that, it is important to allocate a resource pool for teams that not only prevents you from worrying about excessive costs, but also empowers the team to use the resources, however they find it more useful. There are many ways of doing that, but one easy way, especially if you go multicloud or if you run many clusters everywhere, is to use budgeting features of kubesphere, which I quite like, and they're quite easy to set up, as shown here. If you do that, all of a sudden you don't have to worry about things being too expensive anymore, because you're going to cap the use of the team. And the team won't have to escalate whenever they need to buy something on AWS, whenever they need a new instance or things like that. So you empower the team to move fast and remember that your environments, they don't necessarily have to be too similar to production. Depending on what you're doing, you can have lighter weight environments, like when developing or validating the product. And sometimes you can have functional equivalents, even for testing, unless you're doing benchmarking or something that depends on representative inference. So the way I think about this is that for development we need functional equivalents. So the queues, the services that you depend on, they need to be running so that you can point to staging, point to pre production, and then you can do your dev there and you can be reasonably confident that things will work as they should. For validation, product designers and qas, they need some kind of guarantee that things will be similar in terms of performance and in terms of overall functioning as they would be in production. And for those tasks, these people don't usually put much load into the system, so you can scale them down and you will still have performance equivalence. Finally, for testing, infrastructure equivalents is important so that you catch bugs that you'd only catch when you have a certain pool running. So the important thing here is to understand that even though you can have an equivalent infrastructure, by equivalent, I don't mean equivalent to says that you're going to have 1000 instances, that you're going to offer lots of benefits in terms of performance. Many of these bugs can be called with much less infrastructure than people usually have. Another thing I recommend doing is to generate a representative fake data set for production database schemas and integrate that into your staging environments deployment pipeline. That way every new environment gets a fresh set of data, which is super useful not only for qas, devs and other folks to do representative test, but also to validate migrations will work appropriately, which is something we've seen quite a got many, many people that come to talk to us at Ergomake, they have this problem. They have migrations, and they are not sure that their migrations are not going to destroy some kind of important data. So they want to have these environments, so that the migration runs there, and they want to have some representative data there to have a good guarantee that things are not going to break when those migrations go to production. And if you want to do that, it is quite easy to write a small script, to read your schemas and integrate it with libraries like Faker. And while you can guess what data goes there and generate it on the fly by looking at the schema. So use Faker to remove any PII, to remove any data that's sensitive, and then you put it there and then you have a good guarantee that that data is going to be similar enough of production to catch bugs with migrations. Now, to solve the late feedback problem, something we've talked about quite a lot is about empowering known technical users and shift left on their contribution. And I usually like to do that by creating a directory where these people can get links to all these environments and access them at any time. Usually the pull request is the best place to have that. So tools like Roselle, if you're using xjs and ergomake itself, can spin those up for you automatically. So now all of a sudden, instead of those people having to go to staging, validating things there, and not knowing even what's there, these people, they can review code functionally by clicking a link on a pull request. And just remember, you may want to put some pages of these previews behind some type of auth to avoid, for example, features being revealed to the Internet or crawlers indexing them. Finally, another very important thing is to have some kind of observability set up, so that when product people or designers or QA is doing some tests, you get alerts and you see that something went wrong with metrics or somehow. So if you just deploy a promptio server or some kind of alert manager which sends messages to slack, that's ideal, because many times, even though things will work okay for these folks, you will not see an error thrown in the backend, which may be something critical. Synthetics and e three tasks are also a great thing to have running against these environments, because if you have production, fine, you're going to catch errors, you're going to get alerts when things got wrong because there's people using it. But for pre production there's usually got as much load, so you can kind of preproduction these workflows with synthetics tests running against these environments all the time. And you can also save production from having to be the target of etest. So you can run ET test separately isolated against each of these environments. And it's also important to say that you won't be able to catch all possible problems with these environments. Their main utility is to shift left on as many stages of the development lifecycle as possible. Therefore, you should really focus on having a reliable production environment and have feature flags to avoid holding things up and got deploying if something cannot be deployed. So you just switch a feature flag on or off, and then you can continue deploying even if something escapes this review process. This shift left validations. So to summarize, use on demand environments, not just one. Empower developers to spin up environments, manage resources on a team basis, use fresh copies of data for staging environments, make preproduction environments observable, run periodic tests against those environments, have different profiles for preproduction like lightweight and heavyweight, and do not allow preproduction to become a bottleneck. So for Q A, you can either send me an email at Lucas Costa at got ergomake.com or you can go to Ergomake Dev and you will find some links there where you can talk to us on Twitter. I'm at the Wizard Lucas. You can also send me a message there, right? Thank you very much. It's been a pleasure.

See all 54 talks at this event!

Conf42 Cloud Native 2023 - Online

March 30 2023

Designing Staging Environments: A Blueprint for Pre-production

Video size:

Abstract

Summary

Transcript

Lucas da Costa

Founder, Engineer, and Published Author @ Ergomake

Awesome tech events for

Featured event

2025

2024

Info

Conf42 Cloud Native 2023 - Online

March 30 2023

Designing Staging Environments: A Blueprint for Pre-production

Video size:

Abstract

Summary

Transcript

Lucas da Costa

Founder, Engineer, and Published Author @ Ergomake

Awesome tech events for