Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone, thank you for coming to
our talk. Cocktail of environments, how to mix
dev and test and stay alive. My name
is Alex Aleksandr Tarasov and you can find me
on different social networks. My names
are strictly consistent and I'm happy to present my
co speaker today, Dmitry Ulianov Dima.
Hi guys, and welcome to our talk about cocktail
of environments. And let's start from the beginning.
Definitely. Let's assume that you're
working in the young tech company and
you are at very early stage, you have only
development and production, and at some
point CDO comes to you and asking
what about testing? We don't have it. And let's
establish testing process. There are
obvious solutions, just let's
set up staging or testing environment in between of development and production.
We are calling it typical environments.
But today we'll present yet another
approach, how to make it better.
Let's set goals first. And our
initial goal was to have always
stable testing environments. We expect that
it will work all the time. It will stable enough.
And it would be great as well to
have development environments stable too,
because if you want to test your new service which somehow
interact with the front end,
probably you want to have front
end and all dependent services stable too,
right? Second thing is we
will try to minimize the gap between development and
qas, because usually what I've seen in my experience,
if you have separate testing environments where only qas
work, developers will not pay enough attention on these
environments. And finally, when you are getting more and
more microservices, it's becoming more fragile.
And despite of microservices,
microservices architecture, it's designed to be stable.
That's funny, but actually it's not as
good as it sounds, because if you
have hundreds of microservices, it's becoming quite
tricky and hard to keep it stable.
So this is a second goal to
minimize this gap between developers and QAS.
Third, we'll try to unlock parallel
testing here. I mean that if you
have two or three qas, basically two
or more, their tests potentially
could affect each other and
we would like to solve it somehow. We thought about that and
we'll share our thoughts on this topic. And finally,
it would be great to keep it as simple as possible,
but let's see what we'll have. Yeah, the main
question here is to create new test environment or not. And if we think
differently here, so we can imagine that we have
some atypical environments and mixing development
and stage environment into one environment, at least it
could be, for example, one physical Kubernetes cluster.
But even if we have one physical cluster,
we have several logical environments. It means that so we have
different, maybe namespaces, or it will be different
deployments or argo allowed. And so,
and if we consider which environments,
logical environments we have, we can define that. First, we have stabledef
that always contains all the services with the same versions
as in production to test something, right? So it's
our foundation and all default routes
come to it. And yeah, as I said, it's a foundation
for developing, for testing and for staging.
And if we talk and try to
explain what stable dev means, so I can show
you this mem, and we have development environment,
we have testing and staging environment, we mix them and
we get stabledev. It's like a development,
but it's a stable part of development. It could sounds
a little bit controversial, but let's try to not create
new environment, at least new physical environment,
right? And try to put all these components inside
one. And because of that we have different
logical environments.
For example, we definitely have branch dev,
and branch Dev is a second part. So it means that every developer
can test their own feature branches on
the same cluster, right? So when they develop some feature,
and also we have release candidate and deF,
because every new release we deployed, it should be
presented as candidate. First, we need to evaluate, we need to assess
it before we can go further.
And to implement it, we need to address several
issues. Mitri, could you please tell us more what
these issues are? Yeah, after we thought
about that a lot. And finally,
initially it was mind blowing how we can mix
different logical environments in one physical infrastructure.
And finally we defined the following
groups of issues. First is how to rotate traffic
between microservices. That's obvious thing.
If you want to test somehow to pass the traffic to
a feature branch of some specific service,
you need kind of service mesh.
Second topic is event routing.
Not all the time you're passing real time traffic.
Sometimes it's background events generated
by cron jobs or whatever,
and you need somehow to pass
events to specific version of your service.
And third topic here is data isolation.
If you want to ensure that your
parallel tests, our third goal, is not affecting each
other, you need to isolate the data somehow.
Yeah, let's talk about all these things
and start with the easiest part, that is
called service mesh. So in fact we have these
issues to address. So in fact we want to test our release
candidates and branch versions of developer in a cold chain.
And to implement it we need to do service injection.
Imagine we have payment service, front end
application, some Nginx and I as a developer created
new branch feature MP 101 blah blah blah and
deployed it to dev cluster with
GitLab pipeline. And then in fact I want
to test my code, right? And to do it I need to
wrote the traffic, wrote my requests to
my concrete version of payment service,
and from payment service I want to call as usual
other services. I need to do some integration testing here.
And to implement it we can use a special header
like IQ service route that consists of the name of
desired service and the reference name MP 101
here. So it's like some reference name our system is aware
of. And if we want to test
front end application, it's easier to use cookie
with the same name and Nginx can unpack it for us.
If you want to test more than one branches, then we can expand our
specification and include into it
several services with several reference name as well.
And if we talk about release candidate testing,
that is very crucial for QA engineers, they can do
it the same way by using a stable reference
name called RC for this candidate, right.
And we can implement it different ways.
Which ways you can suggest to implement it?
There are a lot of ways how we can
do it. Nowadays there are
few implications of service mesh which will
work out of the box like ISIO or LinkRd.
There is another approach which is client side building,
but we decided to go with istio. Yeah,
it's mature solution for service mesh
and we can define some istio virtual service and
deploy this virtual service with every stable version via
our common helm chart. But there is
a problem, we cannot do it the same for all
our branches and for release candidates because it's a separated
helm charts and so we cannot create and
manage one CRD from all these deployments.
Right. So to solve this issue we can
use virtual service merge operator that
can patch target virtual service with
some new routes.
And in that case we can
solve our issue and deploy this CRD with every
branch with release candidates. So looks
like we solve our issue. Do we have any other issues
with service with request routing?
Looks quite elegant. Thanks Alexander. Yes, I think
there is one open question still.
For services everything is quite clear,
but in the same time we have webhooks when our service is
called by some third party external service,
and this service obviously is not aware
of our internal infrastructure, our releases, service names,
versions and so on. So what to do here?
That's a good question. So we have several solutions. The first solution,
it's quite simple, we can just go to third party system and
change our webhook URL but it's not very convenient first,
and the second we can affect our stable version and
our testing process. That is not good. So let's consider better
solutions here. The second one is to create advanced
mock or fake and get rid of external service
dependency at all. And because we
implement this fake by our own, for example, yes we can implement
any logic we want there and we can understand
by request who called the
fake and which particular version of service we
need to call back. And the third solution
here is to using smart proxy and use
some correlation id for matching request from and to external
service. In this case we do not call external service directly.
We use smart proxy and using database
on in memory k value storage we can correlate
our requests and change the route
of incoming requests back to our
system and to desired version
as a release candidate or branch version.
So Mitri, what your thoughts on these solutions?
Both are good and have their own
advantages. In case of third solution which we see
now, it still
will communicate with external service, right? So it's
good for end to end testing. And smart proxy is
quite simple thing, just keeping
in our case we use in memory storage to
extract, we extracted correlation
id, it's basically just object id in our system.
It could be order id, payment id, whatever and
very simple, very simple and elegant elective
solution here. But second
solution where we had fake
service,
the advantage there is that the solution is independent
of external service. So depending on the use case,
depending on maybe test scenario, you can choose any
of these solutions, both will work for their
cases, fake will help
us in case of external services down currently for example.
And your tests just becoming
more stable and reliable, which is good too. Yeah,
good. And looks like we solved our first
bunch of issues with routing and we can move further
to event routing. So why we need route
our events? We have routing for
logic, for traffic routing, but in case of events
it's more untrival
I would say because we have more
scenarios here. For example,
as a developer I'm creating new version of
my microservice which is
let's say consumer. And in this case I need somehow
to pass specific event generated
by producer to my concurred
new version of microservice.
How to do it? Not really clear.
In the same time we have perpendicular
option when we need to test producer, we need to generate let's
say event on the new
version of producer and handle it
on the stable version of consumer. So it's matrix
of options and let's see what can we do here. Yeah,
the main question here is who should process a message, right? If you have
for example one subscription and one topic and several consumers
in different implications of
our service, right?
In its different versions. So we can use the
same approach as we did for request
routing. We can use event routing. It means that we can
use some discriminator, some built
in attributes, for example in our message and
pass them our x service route as we
did it for request routing,
right? And based on this information,
based on this information, we can process or do not process
messages. And to implement it, we can use one topic,
still one topic for all messages.
But we need to create
several subscriptions. One subscription for release candidate,
for example one subscription for all branches.
But in that case, if we use one subscription for
all branches, our developers can interfere with each other.
That is not good, right? And the better solution here
is to create separate subscription per branch and in this
case so our messages are not interfere with each
other and to implement it,
to implement this part. So we need to do these things that
first we need to create our static subscriptions for release candidates,
then we need to create dynamic subscriptions for branches.
And finally we need some common library that we
will talk about later. Let's start with static subscriptions
for these candidates. It's quite easy.
We have service catalog that is state machine
with bunch of modelers and we just change our pub sub model,
rerun the state machine for all the services and
get custom subscriptions for
these candidates. And for quality assurance folks
with dynamic subscriptions, it's more complicated
because in this case we need to have one more service.
We called it M Hub and we register every
branch. So when I as a developer create
a new branch, right? So on the first pipeline run,
I register my environment and then reapply
pub sub model and custom subscription is
created after that. And to clean up all these resources we can
use the same. We can deregister
our environment on branch deletion, then reapply pub
sub model once again and our
custom subscription is gone. And this is a
very clear and straightforward process because we have
it in our pipeline. So as you can see
here, provision Resources job creates these subscriptions
for us and the provision resources clean them up and eventually
we will clean them anyway, these subscriptions.
But I think that we have one more issue.
Mitri, could you please think about what
we forgot here? Yeah, the question is,
as you said, first issue is
who should accept the message? And second question,
how to not accept the same message and
handle the same message in all
my releases, right? And there
is another thing which I'm thinking about
is, for example,
if I'm generating events
and this event is not for some
existing environment. Let's say you mentioned MP
101 previously, right? And I'm
generating event with the routing key
mp 102. Let's say how to not skip
this event, because if we'll skip events by design,
our data finally will become inconsistent, right?
So that's not very nice.
Let's see, what can we do here? Yeah, to solve these
issues, we need some common library that will do for us context
propagation and message skip logic. And this library should be
used by every microservice in our system. This common
library do follow the following things. The first,
we grab our Xsource route from
server context and put it into attributes of pub
sub message. Then we send this message, and in all
consumers we receive this message. But before we decrilize
it, we need to decide should we skip or should we process this message,
right? And there is a tricky thing.
So for example, if I pass that, I want
the reference name as MP 101.
So inside the library, my canary. Or this
candidate can say that, okay, it's not for me, definitely. And payment
service MP one four six can say okay, it's not for me too.
And payment service MP 101 said, okay, it's for me because it's
my reference name right here. But what to do with the
stable release? Should we process this message or not? Because imagine
that I pass, as Mitri said, like MP
102, who should process this message. Looks like a stable one,
but stable one should be aware that we have no MP
102 version of payment service,
and it leads us to make our library
aware of all other versions. And so we need
to have some kind of real time configuration. We need
to fetch the current state of all
our version of our services from
environment hub service, right.
And this could help us with this issue.
And finally, we need to put
this x source route back to the client
context. And we need to do it because we
want to follow like a cloud chain. But this cold chain
will include not only our HTTP or GRPC
requests, yes, but also events. And we
can say that our cold chain could consist of real time
messages and some events,
right? Some background messages here. Looks like
we solved the issue admitted, right?
Yes, exactly. Looks we've
really solved the issue. And basically what we have
here is extension of the service mesh logic
for events, right? And one note here,
this solution assumes that our common library,
our service, is aware of its version
to understand that some messages have
to be handled on this service.
Quite nice solution. Let's switch to
data isolation. Yeah, the last one.
Okay, why we need to care of this
data isolation, what we cannot use, for example, one database for all
our versions of services, like let's
developers use the same database. Logical.
Why? It's a good question. Be aware of. Yeah.
Initially we defined the goal for us
to enable real parallel testing,
right? And it assumes
that my database should
not be touched by anyone who is developing something in
parallel or who is testing something in parallel. I want to
have isolated data and break my service by myself only.
So to achieve that we
need isolated databases.
Probably, yeah. So for example, I'm as
a developer can write a migration which will break
our database. Yeah, that's not good.
All our tests will fail. So what we can do with it?
So we can use the same approach that we used for subscriptions.
We can create, for example, every night separate logical
database for all branches. That's easy approach
and it very well suits even with large
amount data in databases. But we have here
the problems that our developers could interfere
with each other once again. And we have another one
solution when we use separate database per branch
and we create this database as we created our
subscriptions here on the first branch deployment.
And when we do it, right, so we have some nightly jobs
that exports all databases from our
stable dev environment, the stable versions
of data to gcs. And then,
so we import this data with terraform DB model when
we need it on branch creation, and we can easily create
new logical database from
these dumps. Mitri, do you see any issues with this
approach? Looks quite
good, but obviously there are some
potential issues. For example, in case of my
database is pretty big.
What else? Yeah, and I think that,
for example, if we use this
approach for things like redis for caching,
it's okay because we have lost our cache, at least we can refill
this for a particular branch. And so.
But if we using it for databases,
it could lead us to incomplete data, right? For example, if you have
a payment service and order service and I
deploy new, for example, payment service, and there
will be situation where we have
orders or we have order that has
no payment and for example, we need to be aware of it.
Mitri, what we can do with it?
I think nothing special. After discussions
with the engineering teams, we decided to accept
this risk because it's quite typical thing
for microservices and services
have to be ready for some data
skew and for
some data inconsistency because in this
case I
would say we couldn't find any good
solution how to keep the data strictly consistent.
So finally we decided to accept that.
Yeah, and looks like we solved all our
problems. So we implemented routing,
event routing and data isolation, and now we can go
to some general part. So I
called it ephemeral environments. Miti, do you have any
clue what are ephemeral environments?
If we are talking about testing of one single
service, that's clear,
let's say it's isolated testing of one feature,
but what to do? For example, if we are changing
or adding or replacing multiple
parts of our system at once, and we want to test how
all these parts works together. Yeah, and it's like
real life examples. And in that case we can use
ephemeral environments. For example, we can create some MP 101 environment,
and as you can see here, so our request comes to stabledev
first, then goes to payment service MP 101
as an HTTP request, or then we
publish some message to the common payment
topic and read this message and process it by specific subscription
in order service MP 101 that has its own
logical database. And then our request come to
stable dev once again, because we don't have order
allocation service MP 101 and it uses the stable
version of database. So we can test any
scenario we want, we can test every complex things
here. And if we talk about types of ephemeral
environments. So yeah, as we discussed, it could be just only one
service, like I'm as a developer deployed from my feature
branch, right? It cloud be several services I deployed, and it
could be custom environments, for example
for a squad or for domain like a warehouse.
And if we look at the bigger picture, we can see that we have
stabledev, that is a boundaries, and we can create
any ephemeral environment, we can mix them with each other,
we can put into this ephemeral environment
any service we want with any versions we want to test.
And it provides for us the great flexibility here.
So I think that it's the
end and we can reflect a little on this
solution, on this hybrid solution. And let's start with benefits,
of course. Why is it good? Yeah,
looks like with this solution, if we thinking
about this IQ service road key,
as in some abstract thing,
it's becoming like possibility to have
endless amount of environments. But let's
switch to conclusions and what
do we have? Finally,
we've solved the natural issue, which comes
when you need to enable parallel testing,
even if you have separate environment for
that. Let's say that we have isolated
staging environments, separate cluster for
that. Anyway, what to do if you want
to test things in parallel.
And we are running integrational tests
and these tests can affect results of each
other, what to do. So finally you
will come probably to something more or
less similar to what we discussed right now.
And honestly, when I thought about that,
this was one of triggers why we decided
to go with this approach.
Second thing, second important point is that
as you've seen, this solution
requires a lot of dynamic parts in
infrastructure. We are creating dynamically environments,
subscriptions we are creating not
really environments, but subscriptions for feature branches.
And first, it assumes that you
are using some managed solution for that. It could be
in house, it could be some cloud,
but you will need definitely some possibility
to dynamically create resources. And second
thing is that you will need to have
some internal developer portal to be able to manage
all this stuff, because it's becoming quite complicated.
But comprehensive tooling helps to reduce
the cognitive cloud here.
And third point is about infrastructure cost.
The solution will have definitely will
help to save definitely maybe ten or
20% of infrastructure costs because you don't have separate
cluster you are sending here
on machines, or you have less implications,
obviously less machines because of that, less operational
cost, less resources. And finally,
your environment configuration is becoming even more consistent
because of that. Yeah, and platform team I think is happy
about it, right? So you need to definitely less number
of, for example, physical clusters. But what is good
for platform team may be not so good for your test engineers.
And when I made a post in LinkedIn about this talk,
so one of my colleagues said that he spent a lot
of money on antidepressants. Yeah, because as every
solution, this solution has its own drawbacks. And the
main drawback here is a high cognitive load for developers and
queengineers. You should hire more qualified guys,
right? That can keep in
their minds all these Hemas who are aware of distributed
traces, for example, to find the issue why
your request comes to another,
for example, service, why it doesn't work or troubleshoot
this. And you need of course to invest your time
into it, into some tooling. But as Mitri
said, yeah, really. So you will invest this time.
I think that in case of you have separated environments for
modern microservice development and testing,
and maybe it's a third one and
unique, specific for this solution. It said,
yes, we have some data isolation, but in
fact it's not fair data isolation.
So sometimes you can interfere with this, especially if you,
for example, do not do it for some kind of gcs.
Or you can say that, okay, we will not isolate
data, for example, for our GCS buckets as well. So let's
use one bucket for all so,
yeah, it's not fair, isolation, and it requires very strong
team to handle it, by the way. I think
that's all. And if you have any questions, you can reach us out
on the social networks and we will. Happy to answer
on your questions. Thank you and goodbye.
Thanks for joining.