Transcript
This transcript was autogenerated. To make changes, submit a PR.
My name is Ben Conrad and I have the job title of head of product
at MDTP, which is the tax platform at HMRC,
and we're going to spend a little bit of time today explaining what that
is. I'm also joined here with Gerald.
Hi, I'm an appsec snooper. Ben's brought me along
to the platform to turn over stones, pull on strings and talk
in cliches. So you might be thinking that
this talk is going to be incredibly know
how. We try and make sure that Scala services running
at HMRC are secure. And in some ways
it is. We're intending to broaden it out and cover
things that you can use elsewhere.
So Ben, we're here to talk about securing the multi
channel digital tax platform to set some
context. What is it? The headline is that it is
a PaaS, a platform as a service. In an effort to reduce
their postage costs and save on the brown envelopes that HMRC likes
to send out, HMRC have been building digital services in
line with the approach defined by the uk government digital service,
which is broadly digital by default.
And to make this easier, HMRC have what
we call a multi channel digital tax platform.
MDTP, or just the tax platform. The platform exists
to make building and hosting digital services as
easy as possible. MDTP is a platform, AWS, a service,
as I say, and it's where the infrastructure, logging metrics,
alerting CI CD pipelines, testing prototyping templates,
everything that you need to build and develop a digital service is provided
out of the box. And really importantly, nearly all
of it is self service. So for me, one of the highlights working
with you over the years was that relatively recently you told
me about that MDTP had a vision statement.
Now normally I associate vision statements with verbal
gymnastics to make a company sound like everything to
everyone without being offensive to anyone, which then gets
used to align people on mandatory fund days.
But this one I really liked.
Simple, secures services for all. Let's go away
and try to understand what that means. There's so much that you can get
out of this simple statement, probably too much to go in
here, but what does it mean for you?
MDTP has been existence for the last eight,
nine, maybe ten years, and we host nearly
all of HMRC's customer facing digital services.
As a platform. We provide this set of infrastructure and
tools to allow people, developers to build, test and deploy
services. But they have to be written in a certain way. They have to be
written in Scala with the play framework. And we think we're
pretty good at what we offer the platform is hosted in AWS,
but the platform abstracts AWS services so that developers
writing services to run on MDTP do not need
any AWS credentials. And that's a really important point.
They are not writing services to run in AWS
or any other cloud provider. They are writing services to
run on MDTP. And we could move the whole of
MDTP to a different cloud provider. And although that would be a
lot of work, the services running on MDTP hopefully
wouldn't have to make any changes. And in fact, we have managed to do that
in the past. We talk about MDTP being an opinionated
platform. The opinions we hold define
this paved road, this golden path, the bowling
alley of success. And that's what we provide to our users.
And the intention from that following it is that if
you follow the pave road, we'll allow the teams to
build services quickly and efficiently.
Chat we're really trying to do is remove the complexity
and in some ways remove a lot of the choices about which
technology to use. And the payoff for this is that if you follow our
opinions and you stay within those guardrails, then you can focus on solving
business problems and deliver value to HMRC and to
your users really quickly. Now,
this talk is about security, the secure services
bit of the statement, and specifically appsec.
Yes, appsec. It is, I think, a little bit contentious
that we have a platform security team and an
application security team, and maybe we'll touch on why that split exists.
Application security, which is say, we do differentiate from our
platform security team who focus on the infrastructure.
We've always taken responsibility for the securing of
the platform itself, the infrastructure, the features that we build.
However, I guess it should be clear that if you're just concentrating on
the infrastructure, it's only one side of the coin.
The platform itself can be hardened and you can hold it to be
relatively secure, but that doesn't really count for a lot if the applications
that we're hosting are riddled with vulnerabilities that are easy to exploit.
I guess you'll forgive the analogy. If you make really
thick, strong walls that will withstand all sorts of attacks,
it's not actually very useful. If the windows and doors are left wide
open, it's not going to provide that high level of
protection. The services hosted on MDTP have
always, and I hope always will have responsibility for
their own security. However, because of the
consistency and the way that all of these services
are built using common tools and common technologies,
we're able to effectively look for vulnerabilities not just in
a single service, but across hundreds of services at a time.
And we can also provide tooling that enables teams to proactively
check for known vulnerabilities in their own code as part of an automated
CI CD pipeline. It's also important to remember that security
isn't a goal in its own right. It's something that always needs
to be looked at in context. It's no good throwing lots
of security tooling at the problem and then giving yourself a pat in
the back and a tick in a box. We process payments
of hundreds of billions of pounds a year, and we
legitimately pay out many billions, even in years,
without a global viral pandemic. The applications
on the platform process the data for around 45 million individual
UK taxpayers and about 5 million companies.
And that data in itself is really valuable, and the UK
government has got legal responsibilities to protect it.
So the Appsec team are focused on looking at the security
of the applications that we host. I guess this is who
are we worried about, who are the threat actors?
And in some ways this isn't that important,
in that a number of different threat actors may
actually be looking to exploit the same vulnerabilities.
But it's always useful to sort of know your enemy,
I suppose. So when
we are doing a risk assessment,
who are the threat actors that we're looking at? And rogue engineers is
something that we know we could have.
It's worth mentioning. It's not that we don't trust our engineers,
but it's worth remembering that people can have their credentials stolen,
they could be blackmailed, they could be exploited.
Scriptkitties. So any
sufficiently large system is going to be under attack.
And we're not a WordPress site, but we get
lots and lots and lots of requests which seem to believe that maybe
we are, because people are just trying
anything. It's so cheap to do that.
Fraudsters, the garden variety fraudsters.
It's important, I think, that security isn't a technical thing,
or not solely technical. You're also thinking about
how an information system can be abused to trick people out of
money. So there's been a spate at
the moment of people receiving
text messages or communications, which then convince them to hand over
their details so that fraudsters can claim tax free payments on their behalf,
and often without the victim realizing
that this has been a problem so far.
Hackers. So it's important to remember that
very often attackers will go for the low hanging fruit.
It's not so much important chat. You have an unbreakable lock
on your door. But you do need to remember to lock it,
and you need to make sure that you are taking advantage of the
tools that you do have to secure your systems.
And finally, nation states. I think this is
the one that's most difficult,
and it's overkill to think that a nation state is going to attack us,
but actually we are a UK government organization
and we can't ignore the possibility.
So, onto a bit more about the platform. We have
a microservice architecture with a lot of services. There is over 1000
microservices. The numbers fluctuate a bit. I think there have
been around 200 new microservices created on MDTP so
far this year, but not all of those will be running in production yet.
And sometimes we get to decommission old services if they get replaced or
are no longer
needed. How you count teams is quite difficult,
because quite often there is a one to one relationship between a team looking
after a single service. But we also have live service
teams who may look after 50 or so different services,
and there are plenty that fall between those two extremes. And of
course, the teams vary in size as well. So in total,
it depends how you count them. I think there are about 340
front end microservices on the platform.
So there are a large number of digital services, which I
think just, it really speaks to how inventive
this country is at coming up with new taxes. The point
is that we're operating at significant scale and quite
a lot of changes to code. On this chart you
can clearly see Christmas and to a lesser
extent, Easter. And I think the last one is the
late Queen's Jubilee, which was a lot of fun,
no good for productivity. And each of these
lines of code could be built into a new artifact, and then those will
be tested through our pipelines. If a test fails, then the pipeline will
fail and the artifact won't progress any further. This does create a challenge
for HMRC, though, because things are constantly changing on the platform,
and we want to know that we're not introducing security holes with those
changes. But just before I move on, just to be really clear,
the number of changes is not in itself a security problem.
Indeed, it's very, very much the opposite. If we implemented a
change freeze and set the whole platform in async for the next year,
we would become far more vulnerable to security incidents,
not less. A portion of these deployments will be
to upgrade code to remove older versions with known security risks.
But all of these changes will be improvements to services that HMRC
make available. And the higher numbers, the better.
I think that's quite enough context, and I hope I've not bored you all
too much. So the question we've got remaining is
how can we protect ourselves? Trust, but verify.
I used to love using the Russian Prague tovare. No proverray,
but I guess that's no longer cool.
So we've got a number of different problems.
Let's start with one, which is that we have all of these
different microservices and they all have dependencies. And so
that means that we have lots of dependencies, although not
quite as many as there might be, because we have
opinions and because we make sure
that everything is written in scala with the play framework.
But we may well have unsupported or vulnerable code
running across our many services. So what can we do
about that? So first sort of step
at doing that is to introduce something called Bobby rules.
Bobby is a tool that we've written that is used as part of the
builds, and it fails if there are any dependencies
that we don't like. So we can manually say don't use that.
If you use it, you can't build. It's quite
a severe tool,
which is why we sort of tend to be quite careful in using
it. We tend to announce to people that things
will be deprecated and give them some time to
update things. Because if we were to just say
one day, oh, you can't use this, then we'll probably get inundated with
calls saying oh no, we need an exemption because we have got special
circumstances. It's a great tool as well,
not only for sort of preventing vulnerabilities,
sort of libraries, it's also great for enforcing
platform upgrades, and we get quite a lot of reporting
on it. We can see trends as to whether people
actually look at it. One thing that we have done is
we made it so that it can be bypassed. So if somebody needs
to have the security fix for something unrelated,
then they have to go and ask Ben whether they can do
it. And the screen here is a screenshot from catalogue,
so actually it's worth talking about the catalog briefly.
It's an internal tool that we've developed in house, but it's now possible to use
more generic alternatives off the shelf. As a tool. For us,
it's something of a swiss army knife. It holds a vast trove of
information about the application. Nearly all of that is
automatically generated. So there aren't manual updates required to anything here.
And we use it to basically keep an eye on the services, to make
sure that they're all doing the right thing,
and we use it to collaborate with the teams
to facilitate that. They can upgrade the things without the least friction.
So here's a second problem that we've got. We are
big fans of coding in the open. It's really
important. It's one of the GDS standards. We believe it makes things better.
However, you do not want to be leaking secures onto
the Internet. And that is something
chat we know has happened before. So again, how can we stop
ourselves doing that? So what we've come up with
is what we termed a leak detection service.
And essentially it keeps teams on the GitHub commits
that are being done. And when it finds something that looks sensitive,
it will alert the teams via slack alerts, but it'll also
alert the security teams, which teams that we can sort of look
at the bigger picture. Again, it's important that what we're doing
here is to collaborate with teams themselves, with the service
teams themselves, to help them to protect themselves.
So I guess here's another problem. Again,
going back to vulnerabilities, independencies maybe,
but maybe something a bit different. A new vulnerability
is found and the question is,
how do we know whether we are vulnerabilities to it?
So this is one of my favorite
tools that we've developed on the platform, which is called the dependency Explorer.
It allows you to search through all the dependencies of all
the services. And this screenshot is why
the log for shell vulnerabilities that took place at Christmas 2021
was scary. For only about ten minutes,
we got the notification that there's a problem in the log
for J Core library. We had a look and
we found that it wasn't used.
Anyone not aware? The log for J vulnerability was quite
scary because with specially crafted messages
being logged, it could trigger information leakage
and remote execution.
The dependency explorer showed us that our services didn't
have the dependency, so that was great, but it requires
you to know what you're looking for.
So this is one of our newer
tools, which I know Jared wants to talk about.
The problem we've got, I suppose, is that we do have an awful
lot of code with a lot of dependencies and we've got vulnerabilities
in some of them, but it's how do we know what we're vulnerable
to and what we can safely ignore?
So typically what happens here is
that you get some tooling in and do some dependency
analysis. In this case, we've actually got JFrog's x ray.
The problem we've found with JFrog was that the x ray
resource screens was horrible. If you look at the screen there,
you can't really make out what's
being said here. The columns are too small and
we don't all have a bank of 42 inch monitors to
be able to look at these reports. And looking through 140,000
pages of reports, it's just impossible.
But the information that's contained in
x ray is obviously useful. But there is a problem
with that as well, isn't there? Yes. So every
vulnerability tends to have a CVSS score that stands
for common vulnerability scoring system. And very often,
and a lot of tools use that score.
AWS a risk score. So you
can set up policies that say if there's a risk of
higher than eight, then don't allow it. If it's less than eight,
it's not a problem. The problem with that approach is that it's
not a risk score. Now we've actually gone through
each of the cves that were flagged and we found chat. Some of
the worst issues did not have the worst scores,
and some of the worst secures that we found were not an issue at all.
It always depends on context. And if there's one
thing you take away from this talk, please don't go away and define policies
that say anything less than eight can go through because it isn't secure.
What we did do is as a first step,
we evaluated all the dependencies and
looked at the vulnerabilities chat were in them
and then sort of aggregate them. And we actually used spreadsheets.
So instead of 100,000 reports, we're looking at
the individual cves. And then based on that,
we looked at creating
a prototype and then we created tickets
to sort of say, well, how can we turn this into something
that can be consumed by each of the services?
We looked at the fact that we've got a
huge number of things to process
and we tried to be basically agile and sort
of created an MVP and then sort of took it
from there. The start off was just this very
simple three by three board where we laid out what are the things
that will be needed by different parts of the organization.
And then in less than a month, we had an MVP.
That allowed us to look at the
problems, but more importantly allowed the service teams
themselves to check what
the vulnerability issues were.
And we provided assessments by the Appsec team so
that we didn't overload the service teams and saying,
now you've got hundreds and hundreds of different reports to look at. And I
think it looks better than x ray already. Okay,
so it's all well and good finding problems in somebody else's code,
but how can we find examples
of maybe bad code in our own?
Yeah, it's an interesting thing. And again, you can start really
simple here. I mean, we've created what we've termed the risk
ledger. It's just a set of spreadsheets that identify areas
where risky code could live. As an example, we know that sometimes
passing XML can be problematic. So we've identified
all the places where XML passing is happening.
We then use chat to create a sort of risk ledger to sort
of say, okay, these are all the places that we want to check, which then
allows us to sort of find the different patterns of usage
that happen. Now again, you have to remember that this is an
issue of scale. It's possible for somebody to remember
this for 20 microservices but not for
1000. And then what we're doing is we're sort of taking this
risk ledger approach and now we're starting to build tooling around
it. Yeah, I think we've recognized for some time that our success,
our scale has created a problem, although it's also created opportunities.
Our platform security team were our first attempt to really make
security a first class citizen of MDTP.
We wanted to start looking at the security of those things that we had direct
ownership of. And in a way that was the easy part. Although I
think we can agree that like love, security is a journey, not a destination.
With application security, there are other challenges.
Services on the platform are regularly reviewed from a security
perspective, but not as often as we're making changes
to them as the platform. We decided that
we could do more and that's where the idea of a platform based application security
team came from. The first remit is really to go and lift some rocks,
pull on some threads and see what the problems are, as Gerald mentioned at the
start, but then secondly, investigating what we can
do to fix those problems and preferably at a platform level,
at that scale level, so that we can protect all the services running
on MDTP and not require individual changes
across thousands of repositories. So I'd like to
borrow here from team topologies, the service
teams, the stream aligned teams and Appsec can be considered as
an enabling team. Now if you look at this slide,
my first attempt looked quite different. It was sort of like a
hub and spoke with Appsec, sat at the center, but I felt
it completely gave up the wrong message. Service teams aren't
at the margins when it comes to talking about security.
Security issues are always about context, and it's the
service teams themselves that will have the context.
So I've tried to sort of visualize this here with a sort of
double robbery to say that Appsec can't function without
the service teams and the service teams can't do all the security
themselves. Right at the beginning I said that my title
is an appsec snooper.
It wouldn't be possible for me to review security
services at the scale of MDTP, if we had 15
different languages, it wouldn't be possible for Appsec
to do something centrally if everyone did something different.
And so that's where it becomes possible for a central
Appsec team to sort of do some of the turning
over of stones that otherwise wouldn't be possible, that service teams
themselves wouldn't necessarily have the time for. Now, when Appsec finds
something, we get in touch with the owning team and we just talk
to them. It's not about blame, it's not about finger
pointing, it's about collaboration to make services
more secure. And the catalogue always
allows us to find a slack channel that we can go and talk to.
And the whole thing works both ways. When a service team
finds an issue, they can feed it back to the central appsec
teams, they can ask about best practices.
And this is a great example as to how security
works. It's about collaboration, it's about people, it's not about
tools. But for the collaboration to work, we do need those tools.
So this goes back to an important point that I think we've been trying to
make all the way through, which is the paved road. The opinionated
platform enables a certain amount of centralization when it
comes to security, but those are the same benefits
for developing services quickly. We know that we
can allow or enable services to be built really
quickly. We built some for Covid in about four weeks,
but the platform supports both that speed
and also security. So just
on the conclusions here, the pave road
is really important. It makes centralizing application
security possible, and the tooling is really important.
It allows the centralized teams to reach out to the service teams,
it allows these findings to be distributed in a self service
manner. And we're not just chasing people down. It's also,
it really helps if you've got experienced engineers who know what
they're doing. Yes, securing a complex
system is very hard, but you don't have to do everything
at once. And if
you want to start with sort of creating an appsec team,
I would personally recommend just starting by collecting
that sort of threat intelligence. Find out
which service writes files or which talks to a particular sensitive back
end. Make a list, use a spreadsheet, aggregate it, script it, automate it,
scale it, be agile about it.
Thank you very much for coming to a talk. Thank you.
Take care.