Transcript
This transcript was autogenerated. To make changes, submit a PR.
You might be wondering, who is this guy? So, my name is Luka Kladaric.
I am the chaos manager at Sekura Collective.
We deal with chaos. We don't do greenfield development, we don't do features.
We don't put buttons on websites. We won't build your mobile app.
If you are drowning in tech, debt, performance or security issues, we'll put you on
the right track and then hand off either to an in house team or to
a different agency for long term maintenance.
I've been a recovering web developer for over ten years. These days
I mostly work as an architecture, infrastructure and security
consultant. I'm also a former startup
founder and a remote work evangelist. I've been
working remotely for over ten years, so catch me here
if you want to talk about remote engineering teams, remote hiring,
et cetera. So what
is a hostile environments? In this context, I consider any
software development place where engineering is
seen and treated as the implementing workforce of ideas that come from outside
with little to no autonomy. Usually product or a
similar team owns 100% of engineering time, and you have to
claw back time to work on core stuff and
to quickly loves tech debt, which I'm sure everyone knows what it is, but just
to get the definition out, the definition I like
is it's the implied cost of an easy solution over a slower and
better approach accumulated over time. So how about
quick example? Techdet is an API that
returns a list of results with no pagination. You start with 20
results and go like, yeah, I'll never grow beyond 100 elements. And then three
years later you have thousands of elements in the response.
The response is ten megabytes each, and so on. It's fragile
code that everything runs through, things you're afraid to touch,
afraid of breaking things, introducing unexpected behaviors.
It's parts of the code base that nobody wants to touch. If you get a
ticket saying something's wrong with user messaging and everyone backs away from their
desk and starts staring at the floor, that's a measurable
cue that it needs focused effort. It's entire systems
that have become too complex to change or deprecate, and you're stuck with
them forever because there's no reasonable amount of effort to fix or replace
them. It's also broken tools and processes, but even
worse, it's lack of confidence in the build and deploy process.
Breakages should not roll out. Good deploys should always successfully deploy
good builds. It's what you wish you could
change, but can't afford to for various reasons. So where
does tech debt even come from? It's insufficient upfront
definition, tight coupling of components, lack of attention to
the foundations and evolution over time.
It's requirements that are still being worked on during development.
Sometimes they don't even exist. Sometimes you get the requirements when you're halfway
done or 90% done.
So do any of these sound familiar?
We'll get back to that later. Now you won't.
Why does x have to be clean when y isn't? You see a pull request
and you see a hard coded credential or an API key or something,
and you say no credentials in the code base, please. And you get pushback
saying, but look at this like cron job from four years ago with the hard
coded SQL connection with the password.
Hearing this every couple of weeks is
draining to answer these questions constantly. That's techdet.
People pushing back on good solutions is tech debt.
We're a tiny startup. We can't afford to have perfect code a. Nobody's talking
about perfect code b. No such thing exists. We're still
trying to prove the product market fit. Once you do,
you will need a product that somewhat works and the ability to run
at 100%. You don't want your gold moment like
when the business side works to be the complete
technical breakdown of your stuff.
When we make some money, things will be different. No, they won't.
I see whoever's on the left side here used to
live through this or is currently living through this. I love
it. It's not just a fact of bad code,
bad processes, bad tools, or whatever. It's people
learning the bad behavior and then teaching the new hires the same
as your team grows. Like if you have five people who
have a bad process, just a bad approach to things,
and you bring in ten fresh people, there's a chance this works out.
But if your company grows to 50 people and you're still stuck in this state,
you're going to be bringing people one by one or five by
five. The bad people will outweigh the
number of good folks by an order of magnitude, and the good folks
will just be trained in the bad ways. They will have no chance of as
the new person swaying you towards the light.
So what's the actual harm? Unaddressed tech
debt breeds more tech debt. Once some
amount of it is allowed to survive and thrive, people are more likely to
contribute lower quality solutions and be more offended when asked to improve
them. So over time, productivity drops,
deadlines slip, and reasons are manifold. You have some objective reasons,
like it's difficult to achieve results with confidence.
There's a high cognitive load of carefully treading through
the existing jungle to make changes.
Quality takes a nosedive and you start to see a clear separation
between new and clean stuff, which was just shipped,
and older code, which is six or more months old, which is instantly
terrible. You will have the
web team ship a brand new application like the
beacon of light, and then six months later say, that's no longer how we do
things and that should not be looked at as a
guide. And it's like, what happened in the six months? How did it become
the wrong way of doing things? And you
end up with no clear root cause for issues or outages.
There's nothing that you can point to and say, this is what we should have
done six months ago to avoid this situation. It's just
miserable. You have tanking morale among tech staff
because they are demotivated from working on things that are horrible and
people start avoiding unnecessary improvements. Abandon the Boy Scout
rule. Leave things better than you found them. It's a death
by a thousand cuts situation. One big pile of sadness for
people who are supposed to be championing new and exciting work.
And to this you might say developers are just spoiled,
right? False. It's a natural human
reaction that when you see a nice street with no garbage, no gum stuck to
the pavement, you're not going to spit your gum there. That's just how humans act.
You bring anyone from anywhere to anywhere nice, and that's how they're going to act.
You have a bad neighborhood. You chuck your gum on the floor or cigarette butt
because it's already littered, it's already terrible.
You shouldn't allow your code base or your systems to become the bad neighborhood,
because if you're surrounded by bad neighbors, why should you be any different?
And it becomes just a vicious, self perpetuating cycle.
So, just a quick case study. I like to
give you an example of unchecked tech debt over a few years
in just a regular startup looking for the right product to sell
to the right people. And nobody here is particularly clueless,
nobody is particularly evil. Just the way things are done.
And depending on your mixture of founders and initial employees, some processes are
set better, some are worse. But eventually you end up with something like this.
I get a call one day and they say, we'd love for you to come
on. As a senior engineer, we've seen your talks, we like your stuff.
If you could rotate through the teams and find the place where you can contribute
the most, we have a bunch of really great,
enthusiastic engineers, but could use some senior experience to guide
that and just raise the skill level as a whole in the company.
So I go, all right, I join. And within days my alarms start
going off. Very nice people all around, clearly talented
engineers hiring from like, hiring from the best, from the top schools.
But the tech stack and tooling around it was really weird.
I can't wrap my head around it. There's a massive
monolithic git repository. So you go like, where's the code? It's in
the repo. Okay, but like, which repo? The repo, all right,
it's a ten gigabyte repository hosted
in the office. It's very fast for office folks. It's like local
network, but there are remote people in the company. Some dude
on an island on a shitty DSL halfway around the world is going to have
a very bad day if they buy a new laptop, if they need to reclone
or something. There's no concept of stable.
Touching anything triggers a rollout of everything because there's no
way for the build server to know what's affected by any change.
So the safest thing to do is just roll out the universe.
One commit takes an hour and a half to deploy.
Four commits is your deploy pipeline for the workday.
Rollbacks take just as long because there's no rollbacks. Of course, you just commit
a fix and they use the same queue.
So if 09:00 a.m. Or 10:00 a.m. Let's be honest,
you have four commits. That's your pipeline for the day.
And at noon something breaks and you commit a fix immediately. It's not
going to go out until the end of the day unless you go into the
build server and manually kill the, I don't know many
dozen jobs that are in the queue, one for each little system,
so that your fix gets deployed and then
you don't know what else you're supposed to deploy because you just killed the deployed
jobs for the entire universe. You don't know which changes should go up.
There's a handcrafted build server. It's a Jenkins box, of course,
hosted in the office. There's no record of how it's provisioned
or configured if something were to happen to it.
The way you build software is just lost.
You've forgotten how you're building software, and each job
on it is subtly different, even for the same tech. You have
an Android source code that you build three
instances out of like white label stuff, but each of them builds
in a different way, so you'll have a commit that breaks one of the builds,
even though it didn't touch anything with that specific thing, stuff like that.
There's no local dev environments, so everyone works directly on production systems.
This is a great way to ensure people don't experiment because
they'll get into trouble for working on legitimate stuff. It's going to beat experimentation
and improvement out of them like that. People have to use the
VPN for everything, even nontechnical stuff. So I'm talking about like support
and product and VPN failure
is like a long coffee break for the entire company.
And just written code is hitting the master database.
There's no database schema versioning changes are done directly on
the master database with honor system accounting, so half
the changes just don't get recorded because people forget. And there's no way
to tell what the database looked like a month ago,
no way to have a test hostile environments that is constantly the same as
production. Half the servers that
you have are not deployable from scratch. This almost guarantees
that servers that should be the same. So like three instances of python
are different and you don't know what the difference is because you have no way
to enforce them being the same or even
worse, their deployability is unknown or
hasn't been tested, so assume it doesn't work.
Code review tool is some self hosted abandonedware. It's bug
ridden, unsupported, very limited. It's literally like the
thing that people use to develop software is constantly enforcing some limits.
Outages become a daily occurrence. The list of
causes is too long to mention. You have the main production API server
falling down because a PM, a product manager
committed an iOS translation string change.
Of course they did it on a Friday 07:00 p.m. Like after hours
on the other side of the planet. So it's 02:00 a.m. For me where I
am. And individual outages are just not worthy of a post
mortem because there's no reasonable expectation of uptime and
everyone is focused on shipping features. And you get that.
Let's just get this one done and then we can talk about refactoring every single
time. You can't refactor eight years of bad decisions.
You start approaching the point of rewrites, which are almost always a bad idea
to begin with. And every time you say just this once, it's another step
in the wrong direction. So how do you even begin to fix this? This is
just a terrible situation. This is what I walked into. This isn't like six different
places. This is one place and I've left out the
more esoteric stuff. So how do you even start to fix this?
I like to call this state not tech debt, but tech bankruptcy.
It's a point where people don't even know how to move forward, where you
get a request to do something and people go like, I don't even know how
to do that. So every task is big
because it takes hours to get into the context of how careful you
need to be. Now,
at the time, the infrastructure team was staffed with rebels,
and they were happy to work in the shadows with the blessing of
a small part of leadership. So to stop further deterioration
and incrementally improve things, I kind of wiggled my room
into wiggle my way into the infrastructure team as a developer,
and we started working on things. It took over a
year and a half to get to the point where I didn't think it was
just completely terrible. We wrote it all down.
Every terrible thing became a ticket. We had a hidden project
in our task tracking system called Monsters under the bed.
And whenever you have a few minutes, if you have two meetings
with a half hour in between, you open the monsters and you think about
one of them, you find a novel way to kill it. The team
worked tirelessly to unblock software developers and empower them to
ship quality software. And most of the work was done in the shadows,
with double accounting for time spent.
How? Build server rebuilt
from scratch with ansible in the cloud, so it can
easily be scaled up or migrated or whatever.
We now have a recipe for the build server, which knows how our software is
built and deployed. Build and deploy jobs are defined
in code. No editing via web UI whatsoever.
And because they're defined in code, you have inheritance. So Android is
just like, here's how we build Android. And then if there are differences between
them, you extend that job and you just define the differences.
The monolithic repo was slain. We split it up
into 40 smaller ones, and even, like, even the first iteration was
slain again into even smaller repos.
While preparing this, we found three other proposals for killing that repo.
And all of them had an all or nothing approach. And it was either pause
development for a week to revamp the repo, or cut your losses,
lose all history, start fresh, things like that. We built an
incremental approach, split out a tiny chunk, pause development
for an hour just for one team, and move them to a new
repo with their history, with everything.
Infrastructure went first, showed the path towards the light to the others.
We set up a system where changes trigger a build deploy only
on the affected project. And commit to live
is measured in seconds, not hours. Some teams opted out.
Some teams said no, we like the old approach. They were allowed to stay in
the monorepo. A few months later they joined the party.
They saw what the other teams were doing and they were like, we want that.
All servers were rebuilt and redeployed with ansible.
This used to be some 80 machines, 20 different roles.
This was all done under the guise of upgrading the fleet to Ubuntu 16.
Nobody understand what that was. Nobody asked how long it would take.
But whenever someone asked about a server whose name had changed, we would just
say, oh, it's the new Ubuntu 16 box.
What actually happened is in the background we wrote fresh
Ansible to deploy a server that did kind of that thing, and then we iterated
on it until it actually could do that thing. And then we killed the old
hand weaved nonsense and replaced it with the ansible stuff.
Migrated to modern code review software, migrated away from
self hosted git hosting and self hosted code
review to GitHub VPN.
No longer needed for day to day work, not just
dev work. The only thing on the VPN now, the only thing
that you connect to the VPN for is the master database,
and nobody has write access anyway, even when you're on the VPN.
So it's not really useful for Dev work, it's useful for phishing out weird things
in the data local dev environments. No more
reviewing stuff that doesn't even build because people are afraid to touch
it. And no more running untested code against production and
version database schema changes through code reviewed SQL
scripts. So there's a code review process for SQL scripts that actually affect the
master database. And with that you can have a test database because you know
which changes are applied and you can keep them in sync,
etc. The moral of the story is,
don't wait for permission to do your job right. It's always easier
to beg for forgiveness anyway. So if you see something broken,
fix it. If you don't have time to fix it, write it
down, but do come back to it when you can steal a minute.
And even if it takes months to make progress,
it's worth doing. So the team here was well aware
of how broken things were, but they thought that that
was the best they could do. And it wasn't. It wasn't.
But if we had pushed for it to be a single massive project, we said
okay, project, let's do things right from now on.
It's going to take a year to have any measurable impact, it's going
to take this many engineers full time for however
long, it never would have happened. It would have gotten shot down at the highest
level instantly because we can't afford that.
Instead, we turned a small team into
a red team and just went and did it. Job well done.
Take the rest of the week off. Right?
Except that's not how things
should be. It's horrible. It's horrible. Let's admit that it's horrible.
You have a team of people directly subverting the established processes because
they are failing them. And you have managers going, that's fine,
you go do that. Don't tell anyone, and I'll protect
you. But pretend I didn't know. It's terrible.
So how do we do better?
Tech network is very difficult to sell. It's unmeasurable
amount of pain that increases in unmeasurable ways.
And if you put in some amount of effort, you get unmeasurable
gains. It's terrible, right? How do you convince
someone to go for that when they have clearly
measurable impact of feature work? And the name itself,
tech debt, it implies that we have control over. It implies that we
can choose to pay it down when it suits us. But it's not like
paying off your credit card. With the credit card, you get
a number every month, and you go, okay, this month I'm going
to party a lot because it's my birthday or whatever, so I'm going to pay
off 10% of this card, and the next month I'm going to stay
in my house. All month, we'll pay down 20%, and then pretty soon we'll
be out of debt. With tech debt, there's no number.
It's just a pain index with no upper bound, and it can
double by the next time you check your balance. It's incredibly
difficult to schedule work to address tech type because nobody explicitly asked
for it. Nobody asked for you to improve the code base.
They want something visible, something measurable to be
achieved. And it directly takes time away from a measurable goal, which is
shipping features. There's a quote that I love from
a cleaning equipment manufacturer, and it's if you don't schedule time
for maintenance, your equipment will schedule it for you, and things will break in the
most inopportune time possible. So I think it's time for
a new approach. I recently came across an article that
changed the way I think about this,
and it's called sprints, marathons, and root canals
by Goikwajuj. It suggests that software development is neither a sprint
nor a marathon, which is a standard comparison we often hear as they
both have clearly defined end states. You finish a sprint, you finish a marathon.
Software development is never done. You just stop working on it because of a
higher priority project or company shuts down or whatever.
But it's never done. It's never achieved
an end state. You shouldn't need to put
basic hygiene like showering or brushing your teeth on
your calendar. It should just happen. And you can skip
one, but you can't skip a dozen. Having to schedule means
something went horribly wrong, like having to go in for a root canal
instead of just brushing your teeth every morning.
So, new name I've fallen in love with this sustainability work.
It's not paying down tech debt. It means making software development
sustainable. It means you can keep delivering a healthy amount of software
regularly instead of pushing for tech debt sprints or tech
debt days, which every time you do it feels like you're taking
something from the other side. It needs to become
a first tier work item, and it's okay to
skip here and there. You can delay it for a little bit, but the more
you do, the more painful the lesson will be.
Agree on a regular budget for sustainability work, either for
every team or for every engineer some percent of time, and tweak
it over time. It's a balance, and there's no just magic number.
It depends on where you are in your lifespan, of the company and the software
and so on. But work on sustainability stuff all the time
and there's no need to discuss with people outside
the engineering team. Engineers will know which things they keep stumbling over.
They will know what their pain points are. So just allocate
a budget for sustainability work and let them figure out what they're going to do
with that. Time loves time. You can discuss what the effects of that
were and whether they need more, or if they can give back some time.
And it's fine. If a deadline is coming in and you say,
let's pause sustainability for a week or two, for a sprint or two,
it's fine. But it needs to first be there and then you give it
back to product. If there's crunch time and
it helps with morale, there's nothing worse than being told you're not allowed
to address the thing that's making your life miserable.
So term techtech sends the wrong message.
Call it sustainability work. Make sure it's scheduled as part of the regular
dev process, not something that needs to push out feature work to end up on
the schedule. That's it for me.