Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello, everyone.
Welcome to my talk on building a network telemetry platform
to minimize security threats.
As part of this presentation, we'll dive into network telemetry, the
need for a platform and what are the things from the platform needed
to minimize security threats.
With that, let's get started.
First, we'll take a look at why we need network telemetry and the
motivation for building a platform.
Second, we'll look into the fundamental building blocks required
for creating such a platform.
Third, we'll use the blocks and create an architecture that will be able to ingest
billions of network events every day.
And Fourth, once we have the data from the platform, how we can use
it to minimize security threats.
That is what we look into.
Last but not the least, we'll take a look at how we can extend our network
telemetry platform to other feature sets.
Why we need network telemetry.
Gartner predicts by end of, by 2028, 50 percent of the enterprise
companies will be on cloud.
With the, with such rapid adoption, companies are also facing
increased cyber security risks.
Zero Trust, a famous cloud security model, also emphasizes the need for
having, also emphasizes the need for having network segmentation so that
during an attack, or, yeah, during an attack, the attack surface is minimized.
Zero Trust.
So how can we achieve this segmentation?
One way is you have to first understand the network topology of your organization.
And in order to understand network topology, that is where
network telemetry will help.
Second, threat detection and anomaly detection.
Imagine a threat actor has entered your infrastructure.
How will you detect that?
Either using either by identifying a malicious or an unknown IP.
Next, let's say the threat actor is trying to access a bunch of different
resources in a short duration,
which is definitely an anomaly or an abnormality.
You can run machine learning algorithms on your network telemetry
data to even identify these.
So you have telemetry to identify unknown or malicious IPs as well as anomalies.
Lastly, compliance.
Imagine you are a network service provider or a service provider in the
cloud, and you are offering services to a ton of government agencies or you
run in the national security regions.
Network telemetry is an important part of your audit.
Why we need a platform.
Think of platform as a framework That you're setting up so
that others can build on it.
So it is not just one solution, but it is an enabler for endless solutions
Now that you have built a platform if there are a bunch of scenarios that you
want to enable on this It is easy for the service teams to onboard what they
have to do is Come to your platform take the data you have and run with it.
So you're Saving a lot of cost and time for the company
Third, we, you have also achieved consistency and standardization because
there is only one source of truth.
You now have consistency with the way how resources are accessed because you
have one single store where customers can come in and access the data and
standardization because your schema and contracts are fixed and whatever customers
have and any partner teams that want to onboard have to run with the schema.
So what are the critical fundamental blocks in a network telemetry platform?
Like any other platform you have Services that are collecting data Then you have
enrichment going on which is responsible for adding meaningful data to whatever you
have right now And then standardization as we were discussing previously One
single way or one single pane of glass experience for our customer teams
Let's look at a typical architecture on the left.
You see there are a bunch of vms And the VMs have something called
an agent running in the behind.
This is a background agent which runs at regular intervals, maybe every
two minutes or every one minute, and identify all the ingress and decrease
traffic and log that onto your system.
If it is a Linux based system, you can use tcpdump to take the snapshot, and if it is
a Windows system, you can use windowsdump.
Now, with all that information, logged onto the disk, you have another agent that
is responsible for shipping the telemetry data from there to a data store or a SIC.
One critical thing to remember here is, it is better to think about the
schema design from the beginning itself.
So what you see below is a schema that emit, that is
emitted from the TCP dump, right?
Source IP, source port, destination IP, destination port.
And then the VM ID is something you attach.
Now with all this information available in the data store, also one more thing to
emphasize is these are billions of events.
Now you need a big data processing pipeline to capture all this data.
So either you can use streaming or a bad job to Start processing the
data and add additional enrichments.
These enrichments can be anywhere from adding more VM metadata like VM names,
the account the VM belongs to, or the VM SKU details to mapping this mapping
the IP address, either the source or destination to a particular IP range.
generally have all the data stores hosted between a particular IP range.
And once you have a particular IP, determining which IP range
it belongs to will definitely smoothen your operations later.
So your big data or your streaming jobs can also do that.
And once you have enriched all this information, you can put this in a
final data store, which will become the source of truth for your consumers.
To reiterate, our platform architecture consists of collectors, which is your
agent and fluency agent gathering all the data, sending it to a data store.
And sending it to a data store.
Then you have enrichment, which is going on in your big data processing
pipeline, where you add bulk of data or additional metadata, and
then you have your destination where all the events are present
so that the consumers can use that.
Now with that information available, what we have here is an alerting service
that is able to go through the telemetry data and alert the security teams
based on any vulnerability threats.
Let's look at how we can implement the network security itself.
First and foremost, for implementing network security, you need baselines.
At its core, baselines are the minimal security policies that every team
on the organizations should follow.
These policies can range anywhere from mandating Static analysis tools
and secret analyzers into your build process to blocking high risk ports
like 22 on your virtual machines where SSH brute force can happen.
Now let's dive into our network security architecture.
So on the left, this is the data store that we have created
previously using a platform.
And then you have your traffic analyzer, which is reading data from the store And
then it evaluates this data against a baseline the baseline here can have rules
something like Here is an example where we are saying This is conformant traffic.
That means the destination here, let's say It belongs to a SQL server.
Anyone communicating to the SQL server must belong, must have this IP
only, or must be from this VM only.
If there is any other communication going on to this destination,
that is not confirm it.
And then we immediately alert those security teams.
To reiterate, our rule is basic, simply stating any communication
to the destination, which is an ingress, should be only from source 1.
2, which can be a VM or maybe an app service.
Anything, any other communications are non conformant and immediately be
taken care of by the security team.
Now, of course, this is a very simple scenario.
Let's took look, take a look at a little complicated one.
So here we have bunch of data stores that are hosted in a
particular IP test range, right?
2.3 to two point 30.
And these data stores might be, assigned in an IP address from one of these ranges.
Now I go and tag this IP address ranges data store.
Next, what I have is a VM that is trying to communicate to this IP address
range, which also has another tag, which is called VM for simplicity.
So your baseline rule can simply say, if it is an ingress data,
the source should be VM and the destination should be a data store.
And if the destination is the data stored, and if the source is not in.
In the ones you're looking for, you can immediately mark
the data as non compliant.
Actually, you can take a step further saying, if the source or destination
is not a service tag, I immediately mark the data as non compliant.
In this way, by step, I can improve the overall security
posture of my organizations.
Now, one more thing we've, I forgot to discuss this.
This telemetry platform we see here is, will take time to build.
Now, if you're on cloud, you can leverage existing solutions.
For example, AWS, Azure, or Oracle have flow logs that give you information about
ingress and egress data in your cloud.
Now, with that information, what you have to store is, what you have to do is just
store the data in a, maybe an S3 bucket or an Azure storage, and then pull the
data from there for your processing.
So in that way, you have completely get rid of the agent and the
Fluentd that is responsible for sending your data to the data sink.
You can just rely on the cloud providers to send them to your sink.
And from there, you can do your additional processing.
With that, let's get to extensions.
now you have a very good platform, right?
Now you can add more scenarios to it.
So one scenario is security recommendation.
Let's say with the telemetry data you have, you are able to
achieve network segmentations.
And what you observe is some of the service teams are
communicating across these segments.
Which you want to drastically reduce.
Now you can provide recommendations based on this traffic to the team saying how
they can reduce the segmentation so that how they can reduce the cross traffic
communication across the segment so that the overall security posture improves.
Similarly, because you have the data right now, You can identify if your network is
reaching to a capacity and immediately look for capacity provisioning.
For example, let's say there is a bunch of communication between the virtual machines
and your data store, and you see the bandwidth is at 60 to 70 percent capacity.
That is when you will go and decide that, okay, we need more capacity.
So we'll add it there.
These are the further, these are further extensions that
you can add to the platform.
To conclude, as cloud adoption grows, network telemetry will
start becoming important to monitor and protect organizations.
So it is important that we start investing in such platforms so that
when the time comes, we are resilient from infrastructure failures and hacks.
With that, I conclude my talk.
If you have any further questions or if you have any feedback, Good luck.
Please feel free to reach out to me.
This is my LinkedIn email and Twitter.
Thank you everyone.
Thank you for listening to the talk