Transcript
This transcript was autogenerated. To make changes, submit a PR.
This talk today is called declarative everything, a githubs
and automation based approach to building efficient developer platforms.
Before diving in too deep, let's understand what a
developer platform even is. On the right hand side over here we have production
ish things. On the production side,
obviously, there's the true production environment. 100% of the traffic
that that receives is usually from external users.
There might also be staging, which might get a
small percentage of the production traffic. And again, all of this depends on
how the organization has implemented various blue
green type of environments. So yeah, the traffic
split depends on how that's configured. On the non production
side of the house, we have various different environments.
Development is usually one of the biggest ones that engineers get to interact
with on a day to day basis. This might be a local
development environment. On a local workstation, it might be a remote machine that an
engineer normally sshes into, so on and so forth.
Then there's CI CD. The goal for that is
that environment normally serves two types of usages.
One is from the various authors of all of the features and changes that
are being proposed. The other is from our colleagues who are teammates
who get to review the code and see the outputs of the various
test executions. Then we have our classes of
pre production environments, which aren't really production, but are production
ish in the sense that it might be something for our QA teams
to work off of, or a dev environment, which is an environment
shared by multiple engineers to perform various types
of end to end testing. A user acceptance testing environment
might also be in that bucket, which is a
place where a product manager, for example, might go to make sure the
definition of done as was outlined in whatever product
document has been implemented appropriately.
We talked a little bit about a developer platform.
Now let's talk about the developer workflows, and then we'll try to converge
the both of them together. A dev workflow normally
includes, and I'll use the terminologies SDLC and
software development lifecycle interchangeably. They are essentially
the same thing. The SDLC includes two parts. One is the inner
loop and the outer loop. The inner loop is when the engineer is actively
building something, which is the dev happening inside of an IDE.
Some local testing might happen. Then as an engineer,
we might push this branch up somewhere to get it deployed into
some environment where we can do our end to end testing. And out there
we can start to identify some issues. If there are any issues,
goes back into dev and that loop continues till the engineer is happy with
their implementation. To meet the definition of done once
all of that's good goes into code review again, if everything
looks fine over there, we'll move forward into the CI CD stage.
If it's not, it goes back into dev, goes back into the inner loop
of the SDLC, changes keep getting implemented, so on and so
on the other side, the code review, that's the start of the outer loop.
Hit CI CD after that, and then pre production and then prod.
In every single one of these stages, if there's something that goes wrong,
we go back into dev, which is the start of the inner loop of the
SDLC. Otherwise, we just keep going forward into the next stage
till we hit production, which is when our software is going and
serving live humans. On the right hand side,
it's just a different way of looking at the inner and the outer loop of
the SDLC, where the inner loop, as I covered, was the repl of
software development, which is the read well print loop.
Once things are fine, we end up using git as our context
transfer mechanism. We push a branch for code review, for example,
and then it hits the outer loop of the SDLC,
goes through that entire loop, and then it'll go into some sort of a
pre production environment. Staging might be called a pre
prod environment, or a part of the production infrastructure, depending on
how the organization is usually set up.
Next, I'm an engineer.
I've been given a ticket. I need to ultimately get this change into production.
What are the workflows that happen underneath? I will
get the latest version of the source code, pull main,
do a git chatout, b new user table if that's the ticket,
I'm working on t 1234, and then I pull up
my id, get my local dev end set
up. After that, I might wait for my id to index a
little bit, and then I start coding. Once some code has gotten
written, I might do some form
of local verification, might push it to an environment
of some sort to be able to perform some level of end to end testing.
And one theme across this today will be a
lot of the content I cover will be more applicable to the
microservice side of the house, although some of these constructs
should be applicable to monolith architectures as well.
So now I've done some testing, the feature seems
to be working. I'll write some unit and some integration tests, and then once I'm
satisfied with the definition I've done, send it out for a code review.
My colleague, my teammate, will read the code. They'll check
how CI was running for the tests I just implemented.
They'll also make sure that older tests haven't broken as a result
of menu change. Then a diligent code reviewer
might also try to get this into a preview environment of some sort, right?
To check how the functionality, et cetera, is working. If everything looks good,
approval if there's feedback, you send it. And again,
it's the same inner loop of the SDLC till things look good.
After things look good, the engineer can merge the change into main.
As soon as that merge action happens, tests run.
If things are looking good, engineer can move it into pre prod,
do some more testing, things are good, deploy to staging more
testing, things are good, move it to prod. That's normally
the day to day workflow of getting a change into production.
Now, which parts of this can benefit from
automation? I have it bolded in this image over here. The code and id
process, right? Checking out the latest branch,
or checking out pulling the latest main,
getting into a new branch, making sure id is indexed. All the dev environment
stuff is set up. All of my dependencies are there.
That's a bunch of developer toil and friction that can probably benefit from
automation testing in an environment. Can an engineer automatically get an environment
to test in that is just based on whatever branch that they are operating on?
So those are bits directly automation can help for unit
and integration testing. I know there's a lot of work happening in the
AI realm today, still very early, but maybe computers can also help
us write some tests and also help with the code review process.
Usually we all have automated tests running in CI
already. As soon as a certain branch is put up for review or
a request to merge domain on the functionality,
checking the functionality in your preview environment. That's another area where
automation can help significantly and when tests are passing
approval has been given by a teammate
automatically, we should be able to merge things into main, run those tests
in an automated fashion,
automatically provision a pre production environment, run tests in there.
If things are good, get in the staging, so on and so forth.
So now we talked about dev platforms,
we talked about developer workflows, how changes get into production.
Now Gitops, what is Gitops? Githubs at
its core, you might have noticed we
were talking about creating a bunch of infrastructure in various stages
as we were trying to get a change into prod.
Gitops is just trying to drive a configuration
backed approach to codify various of the DevOps
best practices which can be applied to infrastructure automation
using this. Many companies call these golden paths
the ultimate rationale behind all of this is standardization will fundamentally
improve the developer experience, and by
removing humans from having to
make a variety of these almost menial decisions,
to go and tear down environments, create new environments, et cetera. Automation can
help us reduce costs to kill environments when they are not actively being
used, for example. How can this be achieved?
Some sort of configuration that lives right alongside your source code,
your app logic, business logic is living in source.
Why can't all of the supporting logic around codifying
stages of the SDLC also not live alongside it?
And ultimately, why this matters is concept CI
existed for continuous integration, right? How can we have repeated
automation happening every time we merge? A new change test, get run, et cetera,
et cetera. Because it fundamentally helps us improve
our development and deployment times. So that's why
Gitops is important. Now, we talked
about environment multiple times in the last three,
four odd slides. What is an environment at its core?
An environment needs to have whatever runtime
requirements or dependencies an application needs. It could be something
as simple as having a go compiler available,
or something that can execute a process,
a binary as a process. It might need
access to some sort of dev test, et cetera.
Data might need access to some form of downstream services.
Might need upstream services as well. If I'm trying to, for example,
test an API gateway change to make sure that
the website still functions correctly also needs to be accessible to a human
so that they can either test the environment or actually access it to figure
out what's going wrong. Depending on the use case,
if it's a production environment, obviously it shouldn't be isolated from
prod. But if it's a non production environment, some level of isolation from
production is important. Depending again on
how the environment is being used, it might need some form of source code,
build tools, your IDe, and various other developer specific tooling,
and depending on when you're using it. Again, if it's a
prod environment or a staging environment, that's usually a shared tenancy construct.
But if it's my local development workstation, that's a dedicated tenancy construct.
So ultimately, we'd like to get to a world where whatever configuration that we are
storing alongside our source code that can give us environments that are
well tuned and essentially configured appropriately for
whatever stage of the SDLC that we are currently in testing,
testing, debugging, et cetera, all of these are angles
at verifying that the software we are now proposing
or making a change to is fundamentally
not causing regressions anywhere, and it's meeting the appropriate definitions of
done testing
at its core exists for just
one core reason is acceptance criteria. Right?
The product document that I received as an engineer has some
form of features and functionality that it aims to establish.
Has my implementation achieved them all? And I'm primarily,
again, because I mentioned I'm going to focus more on the microservices
side of the house. Again, this is also applicable to monoliths,
but I'm taking more of an emphasis on services and not libraries and
modules over here. The reason for that is oftentimes a
database doesn't live on the same machine as
where our application is running. So a network call is happening
to either talk to a database or to a downstream service, or how upstream
services are talking to me. So functionality, we covered
that. On the interoperability side, I'm calling various
downstream services, usually as part of any feature.
Am I calling those systems appropriately? Am I breaking any API
patterns? Is the latencies that I experience,
is it something that's acceptable to my end users or
my target end users? And ultimately on the
confidence side of the house, which is why again, let's go.
Testing exists. Is this change that
I'm introducing right now? Will it cause future production
deployments to be more error prone? Will it cause more
alerts to go to our on call? Has the feature really been implemented
in the simplest and the most resilient way possible?
So normally when we talk about we
have source code right on the left hand side. I've kind of tried to represent
this as a Monorepo structure. All of these primitives
apply just as cleanly to if a company has multiple smaller
repos like micro repos,
every microservice comes out of its own repo. Doesn't matter.
You can see various libraries, modules, et cetera, all of it live
together. There's some software that runs in the middle which will
ultimately cause some form of artifacts
to get built. For example, a docker container, an OCI image of some
sort that then gets sent to our
production workload management system or software,
which gets us to go and run this latest version
of a certain artifact and it runs it, and then all of our systems
go and talk to each other. So how
does this software actually end up in production? Normally I'll
just use a couple of Kubernetes examples, but this is applicable to
any sort of runtime.
Kubernetes has these deployment yamls or helm charts, et cetera.
Ultimately I push a new branch somewhere, or if it's
a new change in the main branch, a new Docker image might get
built. This image gets pushed into a container
registry somewhere a field is updated
in my helm chart, which is normally in the source code.
This field is probably going to say like this is the
latest version of this image, and then the helm chart gets applied against my
production Kubernetes cluster. At that
point in time the Kubernetes cluster knows okay, I need to pull this image down
from this appropriate container registry. I pull that down, get it
deployed, stuff works.
Now we covered the Prod deployment
process. What is preprod? Preprod normally involves engineers,
as they were saying, pushing their branches somewhere, having the CD system go
and deploy that change into a shared tenancy pre production environment
of some sort. These are great for
the most part till I'll
give an example. I am working on a front end change. Bob is
working on an API gateway update. Sally is going and running
a database migration. When I push my front end
change into the pre prod environment and I suddenly see an issue related to
a database migration pop up in the error logs.
That didn't happen because of me. A GitHub
based approach, as we can see on the right hand side over here, can let
us get into a world where ephemeral preproduction environments are just
coming up, where each of our changes gets tested in isolation.
And finally, when all of the changes land up in
the main branch, similar tests can also be run to make
sure the current state of the main branch is pretty healthy.
It's green, but during our non
production in the dev processes, being forced to share a pre
production environment with all of our colleagues again leads to a lot of
developer friction.
Now images normally
in CI we don't regularly go and
spin up our own version of a full tenancy cluster of
some sort where proper end to end testing is happening.
There can be some cases where images get built and this is
just a little spec on how something like that might happen,
but ultimately it just boils down to check out the latest
source code, make sure all of your relevant credentials
are set up for whatever container registry you're using,
build the image, push it there, and then finally call
whoever your workload management system is. In this case it's a Kubernetes example.
Call it with the latest version of this image that's in this
registry and kubernetes are instructed to please
go and apply these changes.
And in this world in the CI CD process that
I just talked through normally in our
tests we might have configuration to set up some four environment wherein
tests are run and every engineer then has to verify
that the test failures are not environment related. If there are test
failures, of course, they need to make sure that it's only related to the changes
that they have introduced, because environment
related changes again, in this world, the DevOps team is the one that's normally
responsible for making sure the environment is left in the pristine state and
the Githubs world that we talk about, which is what you can see on the
right hand side, the test suite gets instantiated,
ephemeral environment comes up all backed by configuration
that lives right alongside my source code. So if I didn't change any
of my environment, config in source, the ephemeral environment
that comes up, that's a golden path.
In that environment, we know for sure that the only
changes that are going in at that point in time are the changes that I've
implemented for my feature. So when I run the test now,
all failures should only be happening because of my changes
and nothing related to the underlying infra. This is
where DevOps can essentially further add superpowers
their capabilities. DevOps teams can just by having configuration
wherein whatever golden path they want to attain codified
left in source code, every engineer just goes and spins up their own version of
it in the code review process. Again, it's very similar.
Oftentimes engineers, whoever the reviewer is,
need to make sure that the changes that we are reviewing,
nothing is failing because of underlying infra issues or
nothing is succeeding because of potential underlying infra issues as well.
So as a result, preview environments, et cetera, everything can be very
ephemeral. Just come up. I make my changes. When changes get
merged into main, all of the infrastructure that was spun up
just gets deleted automatically.
So Dev and CI stages, they will normally
resort to using Docker for many of these functionalities.
You can see over here, the r has
gotten cut off due to my screenshot. I apologize for that. But ultimately
you can see like these are all just Docker compose files.
If a developer, for example, needs a MySQL database,
we'll normally not go to AWS for example, and spin up an RTS database
just for that. Just use Docker to do our basic feature
functionality testing and this usually suffices
for dev. But as we get closer to CI and
the other stages of the SDLC, that's where having more
production symmetric infrastructure might be more
helpful in weeding out bugs and getting a better sense of latencies,
et cetera, et cetera, as a user would experience when
interacting with a prod environment.
So now why does all
of this automation, why is it important?
How exactly does it improve our developer experience? So, on the left
hand side over here, let's take a normal dev loop.
I write code, I build, compile,
I run it, inspect to see if everything works fine,
make a commit, move on. Let's say as an engineer,
I'm working 6 hours a day. That's about, what, 360 minutes?
This entire loop takes about five minutes to run. So out of 360,
I can do this about 72 times in a day. So 360
over five on the right hand side, as soon as we go into
this microservice containerized world,
we get to this. Like, we do our code, we do our builds,
and then there's the container bailed, et cetera. All of the container
ergonomics come in, and that's pretty time consuming sometimes.
And ultimately, you can see that five minute loop for getting the same
task done is now taking nine minutes. And because we
were working 360 minutes a day, which was our assumption,
now, 360 over nine is about 40 iterations. So on
the left hand side, I could do 72 iterations. On the right,
I can only do 40, which is actually a 45% degradation,
which can take a significant amount of time away from the inner loop
of the development process.
So this dev workflow that we had talked about in the past, like the one
you see on the left, is the traditional dev workflow for all of us.
Pull latest code, set up local n, wait for the id
to index, write code, do all of that testing, et cetera, et cetera. It's a
long loop on the right hand side. Taking this Githubs based model,
it can actually, when we want to do dev, just get an environment,
immediately hook our id into it. Everything has already
been set up for us. I don't have to wait around for anything. I just
do all of my testing. I know that this is a completely dedicated environment
wherein Bob and Sally's changes, et cetera, none of it's going to influence me.
All of the issues that I discover have to be fundamentally related to
the code I just wrote. So now when I'm sending
my code for a code review, I have much more confidence
that things are going to work to spec and in
the CI CD to pre production workflows. If there are some changes on the
main branch that need to get tested, some of that stuff is getting pushed later
in the pipeline. It's not going and impacting every single engineer
every time they are trying to write code again.
So the dev purposes, right.
Normally, whenever we are developing a feature,
service dependencies are normally pretty isolated. Sure, we have
our id running somewhere. But wouldn't it be great, like on
the right hand side, you can see the red circle calls the green
circle, which calls the pink looking circle over there.
If we could get a workflow replicated wherein that red
was actually calling green, which is currently running inside my
id with my debugger connected to it, that would let
me have much greater confidence of the software I was building, because I know
what the longer request chains are going to look like, and I know exactly how
my changes are going to function with respect to them.
So there are various ways of looking at it, right? The green one
I was just showing you in the previous slide. It's an example when my
service or my feature exists in the middle of a
long microservice call chain.
Similarly, if I'm testing something at the start of a call chain or at
the end of a call chain, again, having access
to a production like environment, wherein all of
the network calls, et cetera, are seamlessly handled,
for me, under nape, wherein the only thing that
is under test is the change that I'm working on.
That would help engineers get to having a lot more confidence
a lot faster in the SDLC, right? While they are in the
inner loop,
instead of having to discover issues after it hits the outer loop, go back into
the inner loop. Because that context switching, right? That's the most important
and expensive part for us software engineers.
So ultimately, to wrap up takeaways as
engineers, we all know that issues will always be easier to fix when
it's got way before production. The fastest would
be when we were writing the code in the first place, the first time we
are writing that line of code. One way in which this can
be achieved is by using a very Gitops backed
access to having production symmetric or production
like environments for every stage of the SDLC as I'm
writing my code. And why this is important is
different stages of the SDLC when they are configured differently, it all adds to
different bits of developer friction, ruins the dev organics.
So at its core, taking a GitHub based approach would also remove
the drift between these various stages of the SDLC. So,
yeah, thank you. Thanks for listening to my talk. And I work at this
company called Devzero, where we are trying to operationalize various things
that we discussed in this talk today. If you're interested
in checking out how any of these things can be applied to
your day to day engineering workflows, check us out at www. Dot devzerode.
Thank you.