Transcript
This transcript was autogenerated. To make changes, submit a PR.
You welcome to our talk about code Insight.
You know who we are from our introductions, but let
me introduce form three as well. The name form three
is derived from third generation cloud technology and a
shortened version of the word platform, form three.
We have some great customers, which you can see on this slide,
who choose our payment platform as their payment processing
solution. Form three was founded in 2016 and
we've been growing and scaling ever since. We currently have
around 260 employees, of which about 130
are engineers. We are a fully remote company and
we're hiring. Today. We will be learning about our
journey with code analysis at form three as
well. AWS how we built our custom tool code Insight
we will begin with an introduction of engineering at form three.
This will set the stage at how we deliver code at form three
and what challenges our teams face.
Next, we discuss the requirements that we needed from our
custom tool, followed up by the architecture we have
implemented as part of the code Insight project.
Finally, we round up our journey by discussing its
adoption into our teams and some insights or lessons learned
from the project. We have a lot of ground to cover,
so let's get started. Let's begin with a
discussion of engineering at form three. As I
mentioned in the intro, form three have been around since
2016 and we've been growing ever since.
The way we deliver software has evolved together with our
organization. We now have over 500 repositories
in different languages such as terraform, GO,
Java, Yaml, which our developers contribute
to at different frequencies. Some of our repos are under
development and some are not actively maintained. Built are
important to our platform. We are a rapidly
growing engineering organization. This means that we
have engineers which are quite new to the code base,
contributing to our live services. They need support
and quick feedback on the code they deliver.
Our platform is compliant with the highest standards of
security and should be actively maintained to remain free
from vulnerabilities. We use Travis CI as our
continuous integration tool and all of our code checks integrate
with it built. Most importantly,
code ownership and the Devsecops mindset are at the
heart of everything we do at form three. We firmly
believe that security must also play an integrated role in
the full lifecycle of our apps in order to take
advantage of the agility of DevOps.
Our code analysis tools should make it easy for teams
to deliver secure solutions. The first static
analysis tool we used was Salis, open sourced
by Coinbase. It gave us a consistent set of
static analysis tools across all of our repos.
All the analysis tools were bundled into one Docker container.
The docker container was then downloaded source mounted
and the scans run from the container for each build.
The solution worked well for a while, but the containers
became sluggish and heavy. Afterwards,
we re implemented our own lightweight solution called
SeC can as a replacement form salas.
It was designed to wrap multiple static analysis
tools and to allow our tool can to be easily tweaked
and reused across multiple repos.
Checks were run in a docker container with a configured token,
which meant that each service needed its own token.
Injected scans were configured
for each repo via a make file or a Travis YAML
file. Sexcan brought us two big advantages,
standardized scanning across all repos in a single place
to maintain our scanning tools.
However, while Sexcan delivered on its promise
of a more lightweight scanning tool, it introduced
some other problems which were accentuated by our growing
engineering team. First, there was no enforcement.
Sex can had to be manually configured on every new repo
and could be made optional. Second,
we had no visibility of which repos were
configured to use it and which repos were failing
their scans. Furthermore, Sexcan only ran
on prs, so the repos that were no longer under active development
were never scanned. And finally, it was
difficult to manage adoption and updates as the
config was spread out across every repo. All of them needed
changing once updates were rolled out.
As we continued to grow and have more services and repositories,
it became necessary to move to a new solution.
Some key requirements for the new code scanning solution,
code insight, were identified by the infosec
team. We needed a centralized source code scanning
solution that could integrate well with our development workflows.
A central configuration makes it easy to maintain
and change system wide. Custom code
Insight tool provide metrics for scanned repositories,
even those that are not being actively contributed to.
It is very scary to have repositories that are not scanned
that could get new vulnerabilities over time.
Custom code insight tool be easy to add and enforce for
new repositories. It should also provide a workaround to stop
it from blocking emergency releases.
With all this in mind, the team kicked off the new
code Insight project to address some of the problems that we were seeing
with Sexcam. The project has been the work
of many of our amazing engineers, which you can see on
this slide. I will now hand over to one of these
awesome contributors and my co presenter Ross,
who will walk you through the rest of our exciting journey with code insight.
Thanks Adelina. Right, let's take a look at how code
insight scans one of our repositories code
insight is installed as a GitHub app on each of our
repos, and when a pull request is created, our GitHub app
is notified by a webhook from GitHub on
our side that looks like a lambda function behind an API
gateway. So this first lambda
function fetches the code from GitHub and calls GitHub back
to say that we're going to be performing some scans on that pr.
It looks at the content of the code and uses
that to decide which scans need to be performed, and then
it puts a message onto a request queue to say please
go and run these scans. That request queue
is then consumed by an orchestrator lambda, and that
records some details about the scan, and then creates
a message on the task pending queue, which gets picked up by the
scheduler lambda. Now, the scheduler lambda actually runs
the scan as a task on Amazon's elastic
container service, and for this we're using a serverless
Fargate cluster. Now,
inside Fargate, each of our scans runs
multiple containers in a single task, so we
wanted to keep the scan container itself really simple so it's
easy to add new ones. So all the peripheral jobs have been
pushed out to other containers. The first container
clones a repository and writes it to disk, so that
the scanned container can just read it from disk and writes its results
to disk. Once that's complete, two more containers
kick in to process those results. One of them writes
comments back to GitHub, which we'll see later, and another takes the
results and puts them on an s three bucket for persistence.
Finally, once that's all done, a notification container takes
over, which writes a message on a queue to say that the task is complete.
So when that task complete message is consumed
by the orchestrator, it can update the records and emit
an event for the notifier, which in turn will update
GitHub. Now, once all of the scans are complete,
the notifier can set the final status of the check on the pr.
So we run this whole process for every pr
every time there's a change coming into one of our repositories. But we
also scan all of our repositories default branches
nightly, meaning that even the quietest project has
its code scanned every 24 hours. If one of those default branch
scans fails, we'll notify the owning team over slack.
So what does this look like to one of our engineers? Well, in the GitHub
check information, they'll be able to see each of the scans that
was performed, the result, and a link to the
scan in the code insight UI. In some cases we
want to allow a build to pass even if a check has failed,
and those are labeled here as soft failures. So this
can be really handy for existing repositories where the team is still working through
some issues, or when we're experimenting with new scans. And we
don't want to risk breaking everybody's builds.
If a user clicks through, they'll be able to see details of the scans.
So this is the full suite, each of the scans that have been performed,
and they can click through and see the full content of the logs from
the s three bucket we saw before. But really
the best user interface for code insight is within GitHub.
So my colleague Adam recently finished off the feature so that
engineers will get fast feedback as comments directly on their
pr, right alongside the offending line of code.
So to do this, we need the scan output to be consistent,
and we're adapting our scans to use the sarif standard
wherever possible. So this
shift from SEC scan to code insight has given us
a few real benefits. Centralizing the config keeps the honest
people honest. So now switching a scan off requires
a pull request into a central repository, which the information
security team will review. It also prevents
an attacker from introducing some malicious code in a PR
and also disabling the check that would look for that malicious code in the
same PR. And also
this infrastructure scales to meet our demands.
So it's a fairly diurnal flow. So we
see engineers creating a load of prs during the day, and it tends
to be quite built overnight apart from the nightly builds.
And the infrastructure can scale to adapt to that.
It's mostly serverless, although I should mention this might change over
time because we've just hired our first canadian employee with more to
come so we could see changes in this pattern, but we think that
the infrastructure is going to scale well for that.
So how did we actually bring this in at form three?
We didn't just drop this in and break everybody's builds on day one.
So for existing repositories, we started with soft
failures at first, so can would just report failures
for information but not block builds. And we took
a ratcheting approach to things. So once a repository was
passing, we would then enforce it, and we wouldn't allow it to go
back for new repositories.
Once we were content that this worked, we enforced code
insight everywhere on all new repositories.
So people are getting that out of the box by default.
And we started to gather metrics recently so that we
can assess vulnerabilities that are raised by code insight scans,
but also so that we can check code insight's performance and make
sure it's not getting in people's way. But for those existing
repositories, we needed to drive adoption across teams.
We need to encourage our engineering teams to fix the issues
and get their builds going green. And we did this in
a few different ways. So we created some batch
prs for issues that affected multiple teams.
So where we found something that we could fix in multiple places, we would
use a tool, something like turbolifter, to be able to create
prs across multiple repositories. And we could offer them up to
the feature teams to review and then pull them into their code base to fix
an issue. We arranged some mob sessions,
so we got people from across the engineering team
together to work on improving code insight coverage.
And also there was a feature of code insight that was meant to
introduce some gamification, which is the team leaderboard.
So this is a view of our team leaderboard. But I have to be honest,
this has not been terribly successful in driving adoption. So these
metrics unfairly reward teams who write very little code
product and stigmatizes those with lots
of repositories to manage. So on the whole, I don't think gamification is
right for this sort of thing. Engineers can just see right through it.
So that's one of the things that hasn't worked terribly well. There are
a few others. So,
firstly, it can be tempting to think that because this
system isn't processing billions of pounds worth of payments
like our other code, that it's not critical.
But if code insight stops working, if somebody introduces a can
that just fails everything, then we can bring the whole company's
work to a breaching halt, as I found.
So we have to treat it like production infrastructure. And to that end,
we've introduced ways of canarying new versions of scans,
and we're also adding a better test environment and better monitoring.
Next thing is that we've had a few chunky repositories,
particularly with terraform code, that have proved to be
a stumbling block for adoption. So it's much easier to get a small
code based passing and then improve coverage incrementally than
it is to wrangle some sprawling repository so where you can.
Splitting out repositories into smaller units can help.
Finally, many of the tools we are using to scan our code
are based on databases of vulnerabilities that are held
externally. So a minor vulnerability added to one of those databases
can cause a lot of our builds to break suddenly if we're not careful.
So to address some of these issues, we've got some
upcoming features. So we want to be
able to distinguish between new and existing issues so that
we can target feedback to users. This is going to help
in those legacy code bases where the existing
issues can make it hard to spot new ones being introduced.
We're also going to do more with metrics, so for example, spotting flakiness
of scans or some scans that take too long to run.
And finally, we want to introduce the concepts of age and severity
into code insight so that we only hard fail a built
if it violates our policies on remediation.
So to wrap it up then, code insight has allowed us
to streamline the way we do application security. At form three,
we're running scans on every pull request and nightly
on every repository, so that even the quietest project is being
checked for vulnerabilities regularly. And for our engineers,
it's easy to interact with code insight just as part of their normal
GitHub workflow. So having code insight comment on your pr is
like having an eagle eyed, slightly pedantic colleague
looking over your shoulder all the time, which is pretty helpful.
On the other hand, gamification didn't add much for us.
If you're going to try it, make sure that the metrics driving rewards
are fair and meaningful to you.
So I think a big part of the success of code insight is the central
management and configuration of it, which raises our overall compliance
and that allows us to drive improvements in the tool. It's dead
easy for us to add new scans, and we can take comfort
that every piece of our code is being looked at continuously.
And with that, I'd like to say thanks for listening,
and thank you to my co presenter Adelina. We look forward to
getting your questions on Discord, and please don't hesitate to reach out
to us on Twitter. It's always nice to make new friends.
Please take a look at the form three engineering site.
Once you've worked your appetite there, I'm sure you'll like to get to the career
site and become one of our colleagues. And also check out the
excellent form three tech podcast run by my colleague Kev.
There's some excellent guests on there.
Thanks again for listening.