Conf42 DevOps 2024 - Online

Cocktail of Environments: How to Mix Test and Development Environments and Stay Alive

Video size:

Abstract

Why keep testing and development environments apart when they can seamlessly coexist? Push the boundaries and dive into innovative cloud organization, assess its merits, and understand the implications of a unified approach

Summary

  • Alex Aleksandr Tarasov and Dmitry Ulianov Dima talk about cocktail of environments. They discuss how to mix dev and test and stay alive. When you have hundreds of microservices, it's hard to keep it stable.
  • Service mesh is a way to test release candidates and branch versions of developer in a cold chain. Other solutions include advanced mock or fake and get rid of external service dependency. Both are good and have their own advantages.
  • The main question here is who should process a message. We can use the same approach as we did for request routing. We need to create several subscriptions. To solve these issues, we need some common library that will do context propagation and message skip logic.
  • Let's switch to data isolation. What we cannot use, for example, one database for all our versions of services, like let's developers use the same database. We have another solution when we use separate database per branch. Looks quite good, but obviously there are some potential issues.
  • I think that's all. If you have any questions, you can reach us out on the social networks. Happy to answer on your questions. Thanks for joining.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, thank you for coming to our talk. Cocktail of environments, how to mix dev and test and stay alive. My name is Alex Aleksandr Tarasov and you can find me on different social networks. My names are strictly consistent and I'm happy to present my co speaker today, Dmitry Ulianov Dima. Hi guys, and welcome to our talk about cocktail of environments. And let's start from the beginning. Definitely. Let's assume that you're working in the young tech company and you are at very early stage, you have only development and production, and at some point CDO comes to you and asking what about testing? We don't have it. And let's establish testing process. There are obvious solutions, just let's set up staging or testing environment in between of development and production. We are calling it typical environments. But today we'll present yet another approach, how to make it better. Let's set goals first. And our initial goal was to have always stable testing environments. We expect that it will work all the time. It will stable enough. And it would be great as well to have development environments stable too, because if you want to test your new service which somehow interact with the front end, probably you want to have front end and all dependent services stable too, right? Second thing is we will try to minimize the gap between development and qas, because usually what I've seen in my experience, if you have separate testing environments where only qas work, developers will not pay enough attention on these environments. And finally, when you are getting more and more microservices, it's becoming more fragile. And despite of microservices, microservices architecture, it's designed to be stable. That's funny, but actually it's not as good as it sounds, because if you have hundreds of microservices, it's becoming quite tricky and hard to keep it stable. So this is a second goal to minimize this gap between developers and QAS. Third, we'll try to unlock parallel testing here. I mean that if you have two or three qas, basically two or more, their tests potentially could affect each other and we would like to solve it somehow. We thought about that and we'll share our thoughts on this topic. And finally, it would be great to keep it as simple as possible, but let's see what we'll have. Yeah, the main question here is to create new test environment or not. And if we think differently here, so we can imagine that we have some atypical environments and mixing development and stage environment into one environment, at least it could be, for example, one physical Kubernetes cluster. But even if we have one physical cluster, we have several logical environments. It means that so we have different, maybe namespaces, or it will be different deployments or argo allowed. And so, and if we consider which environments, logical environments we have, we can define that. First, we have stabledef that always contains all the services with the same versions as in production to test something, right? So it's our foundation and all default routes come to it. And yeah, as I said, it's a foundation for developing, for testing and for staging. And if we talk and try to explain what stable dev means, so I can show you this mem, and we have development environment, we have testing and staging environment, we mix them and we get stabledev. It's like a development, but it's a stable part of development. It could sounds a little bit controversial, but let's try to not create new environment, at least new physical environment, right? And try to put all these components inside one. And because of that we have different logical environments. For example, we definitely have branch dev, and branch Dev is a second part. So it means that every developer can test their own feature branches on the same cluster, right? So when they develop some feature, and also we have release candidate and deF, because every new release we deployed, it should be presented as candidate. First, we need to evaluate, we need to assess it before we can go further. And to implement it, we need to address several issues. Mitri, could you please tell us more what these issues are? Yeah, after we thought about that a lot. And finally, initially it was mind blowing how we can mix different logical environments in one physical infrastructure. And finally we defined the following groups of issues. First is how to rotate traffic between microservices. That's obvious thing. If you want to test somehow to pass the traffic to a feature branch of some specific service, you need kind of service mesh. Second topic is event routing. Not all the time you're passing real time traffic. Sometimes it's background events generated by cron jobs or whatever, and you need somehow to pass events to specific version of your service. And third topic here is data isolation. If you want to ensure that your parallel tests, our third goal, is not affecting each other, you need to isolate the data somehow. Yeah, let's talk about all these things and start with the easiest part, that is called service mesh. So in fact we have these issues to address. So in fact we want to test our release candidates and branch versions of developer in a cold chain. And to implement it we need to do service injection. Imagine we have payment service, front end application, some Nginx and I as a developer created new branch feature MP 101 blah blah blah and deployed it to dev cluster with GitLab pipeline. And then in fact I want to test my code, right? And to do it I need to wrote the traffic, wrote my requests to my concrete version of payment service, and from payment service I want to call as usual other services. I need to do some integration testing here. And to implement it we can use a special header like IQ service route that consists of the name of desired service and the reference name MP 101 here. So it's like some reference name our system is aware of. And if we want to test front end application, it's easier to use cookie with the same name and Nginx can unpack it for us. If you want to test more than one branches, then we can expand our specification and include into it several services with several reference name as well. And if we talk about release candidate testing, that is very crucial for QA engineers, they can do it the same way by using a stable reference name called RC for this candidate, right. And we can implement it different ways. Which ways you can suggest to implement it? There are a lot of ways how we can do it. Nowadays there are few implications of service mesh which will work out of the box like ISIO or LinkRd. There is another approach which is client side building, but we decided to go with istio. Yeah, it's mature solution for service mesh and we can define some istio virtual service and deploy this virtual service with every stable version via our common helm chart. But there is a problem, we cannot do it the same for all our branches and for release candidates because it's a separated helm charts and so we cannot create and manage one CRD from all these deployments. Right. So to solve this issue we can use virtual service merge operator that can patch target virtual service with some new routes. And in that case we can solve our issue and deploy this CRD with every branch with release candidates. So looks like we solve our issue. Do we have any other issues with service with request routing? Looks quite elegant. Thanks Alexander. Yes, I think there is one open question still. For services everything is quite clear, but in the same time we have webhooks when our service is called by some third party external service, and this service obviously is not aware of our internal infrastructure, our releases, service names, versions and so on. So what to do here? That's a good question. So we have several solutions. The first solution, it's quite simple, we can just go to third party system and change our webhook URL but it's not very convenient first, and the second we can affect our stable version and our testing process. That is not good. So let's consider better solutions here. The second one is to create advanced mock or fake and get rid of external service dependency at all. And because we implement this fake by our own, for example, yes we can implement any logic we want there and we can understand by request who called the fake and which particular version of service we need to call back. And the third solution here is to using smart proxy and use some correlation id for matching request from and to external service. In this case we do not call external service directly. We use smart proxy and using database on in memory k value storage we can correlate our requests and change the route of incoming requests back to our system and to desired version as a release candidate or branch version. So Mitri, what your thoughts on these solutions? Both are good and have their own advantages. In case of third solution which we see now, it still will communicate with external service, right? So it's good for end to end testing. And smart proxy is quite simple thing, just keeping in our case we use in memory storage to extract, we extracted correlation id, it's basically just object id in our system. It could be order id, payment id, whatever and very simple, very simple and elegant elective solution here. But second solution where we had fake service, the advantage there is that the solution is independent of external service. So depending on the use case, depending on maybe test scenario, you can choose any of these solutions, both will work for their cases, fake will help us in case of external services down currently for example. And your tests just becoming more stable and reliable, which is good too. Yeah, good. And looks like we solved our first bunch of issues with routing and we can move further to event routing. So why we need route our events? We have routing for logic, for traffic routing, but in case of events it's more untrival I would say because we have more scenarios here. For example, as a developer I'm creating new version of my microservice which is let's say consumer. And in this case I need somehow to pass specific event generated by producer to my concurred new version of microservice. How to do it? Not really clear. In the same time we have perpendicular option when we need to test producer, we need to generate let's say event on the new version of producer and handle it on the stable version of consumer. So it's matrix of options and let's see what can we do here. Yeah, the main question here is who should process a message, right? If you have for example one subscription and one topic and several consumers in different implications of our service, right? In its different versions. So we can use the same approach as we did for request routing. We can use event routing. It means that we can use some discriminator, some built in attributes, for example in our message and pass them our x service route as we did it for request routing, right? And based on this information, based on this information, we can process or do not process messages. And to implement it, we can use one topic, still one topic for all messages. But we need to create several subscriptions. One subscription for release candidate, for example one subscription for all branches. But in that case, if we use one subscription for all branches, our developers can interfere with each other. That is not good, right? And the better solution here is to create separate subscription per branch and in this case so our messages are not interfere with each other and to implement it, to implement this part. So we need to do these things that first we need to create our static subscriptions for release candidates, then we need to create dynamic subscriptions for branches. And finally we need some common library that we will talk about later. Let's start with static subscriptions for these candidates. It's quite easy. We have service catalog that is state machine with bunch of modelers and we just change our pub sub model, rerun the state machine for all the services and get custom subscriptions for these candidates. And for quality assurance folks with dynamic subscriptions, it's more complicated because in this case we need to have one more service. We called it M Hub and we register every branch. So when I as a developer create a new branch, right? So on the first pipeline run, I register my environment and then reapply pub sub model and custom subscription is created after that. And to clean up all these resources we can use the same. We can deregister our environment on branch deletion, then reapply pub sub model once again and our custom subscription is gone. And this is a very clear and straightforward process because we have it in our pipeline. So as you can see here, provision Resources job creates these subscriptions for us and the provision resources clean them up and eventually we will clean them anyway, these subscriptions. But I think that we have one more issue. Mitri, could you please think about what we forgot here? Yeah, the question is, as you said, first issue is who should accept the message? And second question, how to not accept the same message and handle the same message in all my releases, right? And there is another thing which I'm thinking about is, for example, if I'm generating events and this event is not for some existing environment. Let's say you mentioned MP 101 previously, right? And I'm generating event with the routing key mp 102. Let's say how to not skip this event, because if we'll skip events by design, our data finally will become inconsistent, right? So that's not very nice. Let's see, what can we do here? Yeah, to solve these issues, we need some common library that will do for us context propagation and message skip logic. And this library should be used by every microservice in our system. This common library do follow the following things. The first, we grab our Xsource route from server context and put it into attributes of pub sub message. Then we send this message, and in all consumers we receive this message. But before we decrilize it, we need to decide should we skip or should we process this message, right? And there is a tricky thing. So for example, if I pass that, I want the reference name as MP 101. So inside the library, my canary. Or this candidate can say that, okay, it's not for me, definitely. And payment service MP one four six can say okay, it's not for me too. And payment service MP 101 said, okay, it's for me because it's my reference name right here. But what to do with the stable release? Should we process this message or not? Because imagine that I pass, as Mitri said, like MP 102, who should process this message. Looks like a stable one, but stable one should be aware that we have no MP 102 version of payment service, and it leads us to make our library aware of all other versions. And so we need to have some kind of real time configuration. We need to fetch the current state of all our version of our services from environment hub service, right. And this could help us with this issue. And finally, we need to put this x source route back to the client context. And we need to do it because we want to follow like a cloud chain. But this cold chain will include not only our HTTP or GRPC requests, yes, but also events. And we can say that our cold chain could consist of real time messages and some events, right? Some background messages here. Looks like we solved the issue admitted, right? Yes, exactly. Looks we've really solved the issue. And basically what we have here is extension of the service mesh logic for events, right? And one note here, this solution assumes that our common library, our service, is aware of its version to understand that some messages have to be handled on this service. Quite nice solution. Let's switch to data isolation. Yeah, the last one. Okay, why we need to care of this data isolation, what we cannot use, for example, one database for all our versions of services, like let's developers use the same database. Logical. Why? It's a good question. Be aware of. Yeah. Initially we defined the goal for us to enable real parallel testing, right? And it assumes that my database should not be touched by anyone who is developing something in parallel or who is testing something in parallel. I want to have isolated data and break my service by myself only. So to achieve that we need isolated databases. Probably, yeah. So for example, I'm as a developer can write a migration which will break our database. Yeah, that's not good. All our tests will fail. So what we can do with it? So we can use the same approach that we used for subscriptions. We can create, for example, every night separate logical database for all branches. That's easy approach and it very well suits even with large amount data in databases. But we have here the problems that our developers could interfere with each other once again. And we have another one solution when we use separate database per branch and we create this database as we created our subscriptions here on the first branch deployment. And when we do it, right, so we have some nightly jobs that exports all databases from our stable dev environment, the stable versions of data to gcs. And then, so we import this data with terraform DB model when we need it on branch creation, and we can easily create new logical database from these dumps. Mitri, do you see any issues with this approach? Looks quite good, but obviously there are some potential issues. For example, in case of my database is pretty big. What else? Yeah, and I think that, for example, if we use this approach for things like redis for caching, it's okay because we have lost our cache, at least we can refill this for a particular branch. And so. But if we using it for databases, it could lead us to incomplete data, right? For example, if you have a payment service and order service and I deploy new, for example, payment service, and there will be situation where we have orders or we have order that has no payment and for example, we need to be aware of it. Mitri, what we can do with it? I think nothing special. After discussions with the engineering teams, we decided to accept this risk because it's quite typical thing for microservices and services have to be ready for some data skew and for some data inconsistency because in this case I would say we couldn't find any good solution how to keep the data strictly consistent. So finally we decided to accept that. Yeah, and looks like we solved all our problems. So we implemented routing, event routing and data isolation, and now we can go to some general part. So I called it ephemeral environments. Miti, do you have any clue what are ephemeral environments? If we are talking about testing of one single service, that's clear, let's say it's isolated testing of one feature, but what to do? For example, if we are changing or adding or replacing multiple parts of our system at once, and we want to test how all these parts works together. Yeah, and it's like real life examples. And in that case we can use ephemeral environments. For example, we can create some MP 101 environment, and as you can see here, so our request comes to stabledev first, then goes to payment service MP 101 as an HTTP request, or then we publish some message to the common payment topic and read this message and process it by specific subscription in order service MP 101 that has its own logical database. And then our request come to stable dev once again, because we don't have order allocation service MP 101 and it uses the stable version of database. So we can test any scenario we want, we can test every complex things here. And if we talk about types of ephemeral environments. So yeah, as we discussed, it could be just only one service, like I'm as a developer deployed from my feature branch, right? It cloud be several services I deployed, and it could be custom environments, for example for a squad or for domain like a warehouse. And if we look at the bigger picture, we can see that we have stabledev, that is a boundaries, and we can create any ephemeral environment, we can mix them with each other, we can put into this ephemeral environment any service we want with any versions we want to test. And it provides for us the great flexibility here. So I think that it's the end and we can reflect a little on this solution, on this hybrid solution. And let's start with benefits, of course. Why is it good? Yeah, looks like with this solution, if we thinking about this IQ service road key, as in some abstract thing, it's becoming like possibility to have endless amount of environments. But let's switch to conclusions and what do we have? Finally, we've solved the natural issue, which comes when you need to enable parallel testing, even if you have separate environment for that. Let's say that we have isolated staging environments, separate cluster for that. Anyway, what to do if you want to test things in parallel. And we are running integrational tests and these tests can affect results of each other, what to do. So finally you will come probably to something more or less similar to what we discussed right now. And honestly, when I thought about that, this was one of triggers why we decided to go with this approach. Second thing, second important point is that as you've seen, this solution requires a lot of dynamic parts in infrastructure. We are creating dynamically environments, subscriptions we are creating not really environments, but subscriptions for feature branches. And first, it assumes that you are using some managed solution for that. It could be in house, it could be some cloud, but you will need definitely some possibility to dynamically create resources. And second thing is that you will need to have some internal developer portal to be able to manage all this stuff, because it's becoming quite complicated. But comprehensive tooling helps to reduce the cognitive cloud here. And third point is about infrastructure cost. The solution will have definitely will help to save definitely maybe ten or 20% of infrastructure costs because you don't have separate cluster you are sending here on machines, or you have less implications, obviously less machines because of that, less operational cost, less resources. And finally, your environment configuration is becoming even more consistent because of that. Yeah, and platform team I think is happy about it, right? So you need to definitely less number of, for example, physical clusters. But what is good for platform team may be not so good for your test engineers. And when I made a post in LinkedIn about this talk, so one of my colleagues said that he spent a lot of money on antidepressants. Yeah, because as every solution, this solution has its own drawbacks. And the main drawback here is a high cognitive load for developers and queengineers. You should hire more qualified guys, right? That can keep in their minds all these Hemas who are aware of distributed traces, for example, to find the issue why your request comes to another, for example, service, why it doesn't work or troubleshoot this. And you need of course to invest your time into it, into some tooling. But as Mitri said, yeah, really. So you will invest this time. I think that in case of you have separated environments for modern microservice development and testing, and maybe it's a third one and unique, specific for this solution. It said, yes, we have some data isolation, but in fact it's not fair data isolation. So sometimes you can interfere with this, especially if you, for example, do not do it for some kind of gcs. Or you can say that, okay, we will not isolate data, for example, for our GCS buckets as well. So let's use one bucket for all so, yeah, it's not fair, isolation, and it requires very strong team to handle it, by the way. I think that's all. And if you have any questions, you can reach us out on the social networks and we will. Happy to answer on your questions. Thank you and goodbye. Thanks for joining.
...

Aleksandr Tarasov

Principal Platform Engineer @ Cenomi

Aleksandr Tarasov's LinkedIn account Aleksandr Tarasov's twitter account

Dmitry Ulianov

Director of Platform & Infrastructure @ Cenomi



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways