Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi, and thank you for attending my talk on road to zero limb failures.
I'm Chris Ng and I'm a senior your staff software engineer at LinkedIn.
So let's jump right into it. So for
today's talk, we'll go over why lint rules, code quality at
scale, and the challenges around that, how to fix it, or at
least the path that we took through zero lint failures,
and finally any key results and next steps afterwards.
So why lint rules? Right? Like, why do we care about lint
rules? So back in 2017,
we always had a saying that if it's not in a docs,
it's not a real thing. You can't be commenting on people's pull requests
about certain best practices or kind of
like rule of thumb, if it's not in the docs,
if it's not in the docs, then you can't flag
it as something that the other person must fix.
This kind of became a thing because so many different people
had different opinions on code quality and what
best practices are, and we really needed just some standardization.
So Lin rules are perfect tools for automating
code analysis and caching
and even preventing error. Consistent in your code.
Common tools are. What you're familiar with is probably like Eslint.
There's also like stylin for your CSS or
sess files, and there's a bunch of other tools in the
ecosystem for more niche use cases.
So lint rules are like really amazing tools for us because
they're really automated, right? It can start
even at your iDe. So when you're typing something
like up here, when you're typing your my component
file, you can see that Eslin is already catching
unused variable, that you can remove your memory code
before you check this code into
your repository. We can run some precommit
scripts using tools like Husky or
to run Eslint on your files, either all
of your files or just change files to flag
this error again. And then you have another layer on
your source control kind of like
choice. And here it's a screenshot of GitHub,
where we run a GitHub action, testing your lint rules
on the files that are being challenges, as you
can see here, three cases of catching not
everyone will have their id configured, not everyone will
have their local procurement scripts running. So you
have different layers, but the idea is to catch it as early as possible,
right? And then we also use prettier.
So there's no more arguments about formatting, no more arguments
about tabs and spaces or anything like that prettier kind of
like really removed all the bike shedding for us
and really saved a lot of time for our engineers.
So like Lind rules, they're more than just for sale issues,
right? Like prettier is what we use for style issues. Lindrules is what we
use for making sure our code is consistent.
It even prevents bugs in a lot of cases. I'll show you an
example just in a little bit. And more
importantly, if you're doing any migrations for your code,
anyone who has code running in production for years
and years and years, you're always going to need migrations.
And lint rules are amazing tools for not only
identifying which areas you need to migrate, but also
making sure that for the most part, everyone's code
is consistent. So it's easier to migrate and do either find,
replace, or like a code mod to migrate yourself
from a v one to a v two, or even from one framework to a
different framework. But yeah, here's a
very quick example of Lin rules on Yashin,
where it's caching something that you might not at
first be super obvious, right? So you have this paging object,
and let's say you're updating this paging object
that you get from API, but you kind of want to be a little bit
safe. And if your data doesn't come back with a
paging object, then you do this optional training,
right? But if it comes back, then you
do some arithmetic. Adding the start
and account that will be the new start for this paging
object seems kind of good,
right? Eslint will actually catch
this with the lint rule, no unsafe optional chaining
here, where it would say that this result
could potentially become non number.
Why is that? You would say, right.
Unfortunately, in this case,
if your argument is an empty object,
you would come back with an undefined plus ten, which equals to none.
And then that kind of just blows up your code. And like edge cases
like this, it's like very easy
mistake to happen if you don't have typescript in your code.
These edge cases will always happen in large application where it's
just not easy or even possible to ask everyone
to check for a condition or rely on code reviews
where there might be new hires or even more
junior engineers code reviewing each other.
And you just can't have that much control. Right?
Like the ship must keep sailing, and there's no
way you can single handedly review
everyone's code, or you also would
go on vacation and not be around. So yeah,
here's like a very simple example of lint rules caching and saving
production issue. So really what
we're trying to do is like expanding the welllit path so you
don't need that constant attention to every single person.
Just make it easy to do the right thing, make sure that
the right thing is easy to do and is kind
of like followed. So what's the problem? Right, let's add all the
run rules, just keep adding all the rules
and everyone will have perfect code and achieve total nirvana
and we can all go home, right?
Unfortunately code quality at scale we face a lot
of different problems with this approach.
But first, what does upscale mean for our application?
That meant an application that's been in production for eight
years old. We had over 24k files
that includes application and test files.
I believe there's like over 100k in their repository,
a lot of internationalization files for sure,
but it's just kind of all over the place. We also
had maybe over 100k commits throughout the lifetime
of the application.
And year over year
it's kind of growing in line of code by 21%.
There's over like 67 prs per
day written by humans, not automated by
any kind of dependency upgrades or something like that.
And most kind of like interesting for me
is that we had over 800 unique contributors to
the repository with
at least five commits. So that's like a non trivial amount, that's not like someone
changing some sort of string
or anything like that. Five commits is kind of like,
you know, kind of like what you're doing at the very least.
And no Javascript, right.
800, that's quite a big number. I don't think that's kind of like the number
that every day is working on the product, but throughout the life
that's quite a big, and you can enforce
that many people even with a small team, right.
So really problem with code quality
at scale is that we needed
length rules to enforce all of these rules, but we're
just kind of increasing the amount of failures throughout in
the application that's never getting fixed.
I think at our peak we had over 7000 errors in
the application, just like hanging out there in various
failures and no one's really fixing
it, right. It's just ever increasing.
So ideally we do the campsite analogy
where you leave the campsite better than you found it. So if you write
a pr, you see a file, you see a lint failure, you fix it
as part of your pr and then throughout time we will get
to a place where the
number of lint failures goes down.
What if your campsite already looks kind of like this, though. When you come to
it, are you still going to clean up all
of the lint failures, or just maybe a couple ones?
So this is kind of like our first foray into limiting
these lint failures. We set in a couple
of rules. Some of them worked really great,
even some of them did not work and had
to kind of be rolled back. First of all
is blocking commits. If there's a link failure in the file.
This was really important for us because this
really stopped the bleeding.
Granted, it was kind of like a not
popular decision, but blocking commits, you can override it
for sure, but it emails everyone.
But blocking commits is the best way to do this.
The second kind of thing we try to do is setting limits.
So maybe you can have five instances of
no unused vars. This is kind of like it kind of
worked, but there was so much added work
in the accounting and making sure the numbers are right where
really you want to just make sure the number goes down.
Right. We don't really care that there's five right now,
and if it becomes six, it's fine, there's some sort
of special case, but really we just want to keep
making the number go down, the number of failures, that is.
And then finally, the third kind of way that
we tried to solve this was a platform review system.
Or if you can think of this as a release owner model where there's
someone in your particular team that's reviewing your code, and then someone
from more of like can experienced group doing some sort of
standardization. So everyone is following kind of like
some guidelines. So the last two
kind of didn't work super well. It works sometimes,
but sometimes it didn't really work so well for our use cases.
But blocking commits definitely worked.
So the only problem with blocking commits, right, like is if
you revisited this campsite analogy,
but then add in the case where your manager,
or maybe your manager's manager, or maybe even the CEO was
like, get this in today. What if you
need to land something really fast? What we saw was that most
people would just kind of move around and kind
of even code around files with law
of lin failures and then ship it done by overriding
or avoiding. Great, right? Let's get this
to production service of customers.
That's kind of like what we're here for, right? To serve customers.
Yeah, it's really great, I think,
and can't really fault them, but not
really fixing our lint failures problem.
So to really stop the bleeding even more,
we introduced two new things. One was
a lend proposal process, and one was a task
force to own, kind of like the lending
framework. The point of the task force is to have accountable
engineers who will maintain this framework
throughout time until we get to zero lint failures,
zero warnings, et cetera, et cetera.
And the point of the lint proposal process was to
make sure that when we add new rules, we don't
add like 1000 more errors, which was another
case where we would just suddenly get a bunch of
new failures because a new rules was enabled and
someone didn't bother to do even like a find, replace or something like that,
which kind of really made
it difficult to get to zero link failures.
So the proposal process made sure
that all existing failures must be resolved before enabling a
rule. A lin rule, as can error, you can enable
its warning, you can kind of like solicit feedback and stuff like
that. But before its error, all existing failures
must be resolved, unless you had a very good reason,
usually something that we've
seen time and time again take production down. So at
that point, let's push this rule out and then make
sure production is stable.
So, yeah, this was kind of like one of the key turning
points for us. All new rules must have
existing failures resolved. This was not super popular either,
because lint authors were questioning,
like, what if there's like 1000 errors I need to solve? What if
I don't know how to fix existing errors? What if I don't have the time
to do this? What if I break something, right? I break production by fixing
lin failure, and there's just other
existing failures in the files I'm cleaning up. But this kind of like,
revolves around the same theme, right? In order for us to get to
zero in failures, everyone must chip in
and all of these other kind of problems that
Lin orders were facing, the feature themes themselves
are facing the same thing, right? Like, they could be facing a lint errors while
they're trying to ship a different, unrelated feature and cause
something, production breaking. What we've seen
is that the lint author is a lot more familiar
with what needs to be fixed, and so they're less likely to
break something in production at
the end of it. What we've really learned is that not all problems are software
engineering. Like, just be a human, talk to people.
You can get really far with just trying to talk
to people face to face or through
any chat app, and just
explain why you're trying to do this, understand why
it's difficult for the other person to do this and help each
other get through it together, instead of
throwing kind of a package across the wall
and making sure not bothering to communicate
with the other person.
So to our road
to zero land failures, we had
even other initiatives to get it down to zero.
That was just to stop the bleeding, right?
So really it starts with incentives, right? Incentives, incentives,
incentives. We really need to understand,
why would someone want to clean up a lint
failure in their code case? Is it just because they
need to, because their commit is blocked otherwise?
Or is there any other kind of reason for
them to clean up? Kind of like learn
failures in their team's ownership.
So we kind of had
these goals of having accountable
code owners, prioritizing lin failures,
having a business case for getting prioritization,
getting some sort of actionable insight for engineers
so they know what to actually fix,
and kind of tooling to tie it all together.
So really we kind of provided a lot of technical support.
So if someone's trying to clean
up a lint failure, we'll kind of guide them step by step.
The developer experience of the
red squiggly lines, for example,
sometimes breaks. And so our team was always
there to help fix those issues.
We provided lots of shout outs to engineers,
and I want to add very specific shout outs
to engineers who have fixed len failures by
actually identifying kind of like very topical
and understanding
of what kind of like
exactly in the prs they've
fixed and provide them visibility for
all their hard work and adequate recognition for
taking that extra time to clean up the lint failure.
So, tooling, right, we kept it really simple.
Identify the owners, identify the errors, keep it up
to date, and that's it.
Very easy. There are a lot of more sophisticated
tools out there, which we ended up
looking into migrating right about now, but just
at the very start, make it easy for people to just
do their job and clean up the failures.
So initially, single owner profile was
kind of like the thing we need to get to nail
down, right? Like you need to find for every single file,
there's only one owner. Doesn't have to be one person,
right? It could be one team, but there's a particular team who owns
every single file. Otherwise, you're just kind of going in
whack and all of going around teams figuring out who owns this file,
if it's kind of like a dispersed ownership model.
We also created a very basic visualizer tool. If you
can see here, we didn't even put a font. We were just in
a serif font here with an HTML
table. But we broke it down by team. We broke it down by
lint rules. And also we broke it down by
team and lint rules. So if you're kind of like part of the
article architects team, and you want to find out
all the no unused variables, because you want to just delete all of that,
you can kind of drill it down by that and
kind of understand how much time you're committing to resolving those
lint failures. For our MVP,
it updated nightly via Cron job that kind of
just like Ran Eslint. It ran on someone's VM machine,
which was interesting during COVID and
we opened up or we counted errors,
warnings and disables. In this presentation,
we'll really just focus on errors.
Here's kind of like a drill down view of a single lint
rule, and the kind of like the files that
have this particular lint failure.
In this case, it's no unused variables.
We use checkup, which is a tool built
by LinkedIn as well. That kind
of like every night. It's like a node tax runner
that will run through Eslint and a bunch
of other tooling, and then you
can format the Eslint output to a particular
format. In our case,
there's like a checkoff formatter for Eslint that
then translates it to serif,
which is kind of like a
standard format that's not super digestible
when you have thousands of failures. But for few
failures, you can click into the file and then it'll jump to
the particular line of code in your file.
This is kind of like can example of a serif file.
Again, your ide usually has a
Sarah friendly viewer to get you the
jump and click functionality.
After that, when we have
the Sarah file, we can then create a dashboard.
We end up migrating dashboard to kind of
like the, the company's
kind of. I believe this is a docusource
document stack, just because someone's
VM environment kept getting shut off and it was just kind
of hard to share ownership for
that by if someone was on vacation or something like that.
So yeah, a lot more prettier UI,
more easier to use, but it's kind
of like the same data as the original, kind of like serif
version of the dashboard.
And then we had a weekly scorecard that we would email to
all the teams where for
every team we gave them a green yellow or red green if
you had a negative week over week increase
in lint failures. So that means you've decreased the number of lint
failures in your app, in your ownership for your team,
or you're already at zero for your team.
So you're green, you get yellow if there's been
no increase or decrease in lin failures. So nothing
really happened that week. And then red if there's a positive increase
week over week. What we've seen is red is usually happens when
some code have changed ownership or
something was incorrectly updated in the configs
or the dependency or something like that. All of these are
really good to know early on
rather than someone figuring it out later on that
they're trying to fix their lint failures and there's just like
a lot of failures that they've never even seen before.
So actually it's been quite useful to have these reds,
although it never was like a rogue engineer
that just pushed in all these Lin
failures. This is a
sample of a scorecard that we would send out with a number of
failure count and the green, yellow or red coloring
per team. This was a manual email. We could have automated
it for sure, but this was not something we wanted to
invest so much time in. What we did invest time in is identifying
maybe two or three engineers who fixed big issues that week
and give them a personalized shout out and a link to their pr to
kind of inspire other people and also kind of
acknowledge that they went above and beyond their
duties and helped out the overall community.
But at the end of it, while we were trying to burn
down the number of lint failures, what we
end up seeing was an 80 20 rule playing.
But in real life, the first 80%
was quite fast, but the last 20%, the last mile,
was really, really hard to get to zero. There was
a lot of stragglers. But the end of it,
we did like a team effort. We asked other teams for
help. And also at the end of it,
too, our team of volunteers in the task force ourselves
just kind of like took on these files and cleaned it up.
You might ask, does it really matter? You got 80%
down. That seems pretty good.
And the last mile was really long. If you can look
at this chart, the first kind of like.
It's almost like half of the chart is the
80%. And the last, mal, is the last. Kind of like another.
Another maybe. It's almost a year, right? I think it's another year.
Granted, I did go on paternity leave around
July 22, which is like right in the middle of this, so I
was not able to, but the teams as much and run
some of these kind of like emails,
but it still took a long time and a lot of hard work to get
the last mile. But yeah, the last
mile does matter. Sustainability is key. If you see
a file in your repository like
this, it's much more easy for someone to
be like, oh, it's fine for me to add a Lin failure,
or if they're touching this file, they're going to just be like, oh,
I'm just going to override because there's a lot of lin failures that
I didn't cause and was just here. This file,
compare a file like this to a file like this
where there's no Lin failures, everything's good.
It's kind of like having code that's already tested. Right.
It's much easier to add a new test than to kind of add
in all the test scaffolding for your existing component.
Yeah. So for us, what we've seen was sustainability
for making sure we're at zero meant
that everyone actually had to have zero lint failures.
So for summary, for the
road to zero lint failures, we had
a rule of no new errors. When enabling lind
rules as error, there should be no new failures.
We had a single owner per file, so we knew which team was
accountable per file. We used a cron job every
night using a tool called checkup
to calculate how many failures
each team had. We also did that for warnings and disables,
et cetera. We had a homegrown dashboard
identifying the issues and with a quick link
to the GitHub line itself, which was
invaluable for engineers looking to see if
they had like five minutes in their day that
they want to just quick bang out, like a pr
fixing something, they can quickly identify that.
We did a weekly scorecard to keep teams and individuals
accountable and congratulating. Also teams
and individuals and some
healthy competition as well.
Recognition, same thing as what I was trying
to say. It's really important that it's
not like a chore, it's more of like a team sport
where we as a community try to figure out how
to get a clean campsite where everyone enjoys it.
It's much easier to work. There's no kind of unexpected
work that arises when you open a file and then
there's a lot of failures. It's much more maintainable
too. And finally, also get your hands dirty.
As a task force, which is volunteer run,
we kind of went through different teams and
just tried to fix their failures for them as well.
If they're having a hard time, just because someone
has to do it, and as someone leading
the effort, you should definitely try to fix it if
no one else is trying to fix it.
So for the results of our kind of initiative.
We kind of learned that individual developer
experience was very important. The actual engineer working
on the fixes should have all
the attention, not kind of like some
sort of 100,000ft
view of the kind of landscape per
individual is very important.
The same thing for shoutouts, like very individualized shoutouts was important.
And no regressions was also very important. Just because it keeps
popping up, these regressions and having some
sort of table was very important, was very useful for
that. So between November
21 to January 23,
we removed close to 6000 lead failures.
Took a little bit over a year to complete,
with 55 engineers contributing
to the effort. We kind of
pulled a 30% increase in satisfaction
for our quarterly surveys
for code quality or
perceived code quality, I guess, for individuals working on
the repository, which was amazing.
This was also coupled with kind
of like continued changes in the lending
ecosystem. We've had over 80 rules changes
or dependency changes in the past year,
I believe, and then over 45 unique contributors as well.
So it's kind of been like the
learning process has still kept going, even though we're trying to get
the number of failures down to zero. So if
you remember backer code, our quote from 2017,
if it's not in docs, it's not a real thing. I think our new quote
nowadays is that if it's not a relent rule, then it's
not a real thing. And so thank
you for attending my talk. Feel free to reach out if you have
any questions or comments. Thank you.