Transcript
This transcript was autogenerated. To make changes, submit a PR.
Our name
you. Hello everyone, I am
Kalyan Dhokte and welcome to conference 42 devsecops
2022. I am from Cognizant Technology Solutions.
Thanks for joining my session. I am going to talk about a role of
SRE in Devsecops before starting my quick introduction.
I am practice lead and responsible to serve customer for building
highly reliable resilient enterprise systems by implementing
modern digital engineering techniques like SRE and
Devsecops. Let's deep dive into the session.
First we will understand relationship between Devsecops and SRE
if I take a minute to understand about Devsecops. It is integrating
security as a part of your pipeline. As you aware in
DevOps we focuses on continuous integration,
continuous delivery and continuous deployment. So you
make design to deploying all the phases automated.
And now devsecops integrates security as a part of entire lifecycle.
So benefit of Devsecops to detect security much early in
lifecycle. What is Devsecops to do with SRE? Let's understand
that as you know SRE focuses on reliability and resiliency
of the system. It also helps to engineer highly performance,
scalable, fault tolerant applications and help customer to build always on
business. So reliability is also mean that you
are protecting your systems from unauthorized attacks
and you are preventing vulnerabilities, threats and also protecting
customer data and finally managing and maintaining privacy of entire ecosystem.
So for all practical purposes, SRE implements Devsecops
principles and security practices. Let's understand what are the
different principles DevOps and SRE
implements. So eliminate soil,
eliminate silos and also eliminating
toil is one of the important principle. Let's understand
shared responsibility of security and embed security
in development teams accept failure
as normal gradual changes, automate everything and
measure everything. So it is important to reduce
cost of failure by incremental changes and
doing the blameless postmortem, basically adopting
security best practices and creating uniform
security posture. So automation also going to help
to improve reliability and automating. Security monitoring and testing
is one of the important aspect of devsecops.
So SRE helps to enforce robust monitoring of
reliability and security throughout also
making sure SLO, SLI and actionable
alerts are getting triggered for any such.
So our systems are becoming
so evolved beyond human ability
and to mentally understand that model and their
behavior is very difficult. So why
these modern cloud technologies are challenging. So modern
microservice based architecture. Cloud native architecture increases speed of
scale along with complexity with improved cost efficiency,
accelerated innovation, faster time to market and the
ability to scale applications on demand. So in this
case with CI CD DevOps and cloud native microservice architecture,
we are building systems so large and changing so often continuously.
We are changing based on innovations and new features
to deploy a very fast pace to
production. And it's very difficult and very hard to track what
is going on in the system. For example,
Salesforce CRM is having hundreds, thousands of microservices,
which is very difficult to track all these things and which increases
the complexity. So as you increase more features,
adding more features, the complexity also going to increase.
So where does this complexity coming from? So we
are creating and deploying new
features with continuous delivery. Cloud native serverless
architecture, API and microservices service
mesh raise engine. All those are very good to
making sure you are continuously innovating
and continuously deploying new features. But that also increases
the complexity. And then when it goes
in a month and then again it's in the end of the year,
you will see very difficult to manage and maintain that entire
complexity because complexity is continuously also increasing.
As you sre upgrading kubernetes systems,
upgrading cloud services, there are some outages you are
fixing continuous, any related to security related issues.
If there are certain misconfigurations, you are refactoring
code, moving lot many applications to
cloud native and serverless kind of architecture,
changing deployment tools as well. So all those factors going to increase the
complexity. If you want to simplify the complex system, you have to change
it. But in order to changing systems for simplification,
you will be adding additional complexity. And complexity is
how you. It is not easy to simplify and
it is very difficult to understand, but complexity is
more about navigating it. So how you will navigate the complexity,
let's understand that. So security is a
context dependent discipline. And then by deploying all these
devsecops and SRE practices, you could able to navigate
the complexity very easily. So flexibility to
change the system rapidly and then applying security context to
system changes. So you need to understand security is
kind of a stateful kind of nature. Okay, we can see there
is a clear conundrum as developers constantly
implements new features and move towards customer value rapidly.
And in this case security will not move on the same curve.
And then there is a constantly you have to verify security and reliability
is implemented as per design.
So in this case continuous security and reliability
has to be embedded. And then you need to continuously check and
verify that security with the security first mindset
and reliability first mindset. That's very important. You need
to infuse all those things into development,
deliveries and in production. So SRE basically
helping in securing the infrastructure applications when
you deploy into production and then it will help to basically
tracking the availability, SLO, tracking, error budget, all those aspects
which includes automated security monitoring as well. So in
this case you need to combine framework of Devsecops and SRE
and continuous security to address assess for such complex distributed
microservice architecture. Because lot of bad actors
SRE continuously chasing such complex enterprises for
various attacks and the race to defend against
attackers must be accelerated compared
to functional failures. The reliability and security failures
frequently impact entire enterprise systems
and also impact the brand value. Hence special attention is
required for the security and reliability. So there are some trade
offs as well between reliability and security. While redundancy
is going to increase the reliability, it also increases the attack surface.
So reliability and security trade offs with respect to
incident management. So when you want to improve
MTTR and then reliability incidents basically benefits
with the verbose logging while those logs would be again
target for those attackers. So you need to be very careful while
making sure basically striking the balance
between security and reliability. Let's understand
what are the common symptoms for failing devsecops
and SRE models. So one important thing is avoidance
and blame. This is one of the symptoms. Lot of developers and
then teams are saying the security team is blamed for
slowing down the projects, the pipeline entirely getting slowed down
because of the security reviews and project teams are frequently sidestepped
that could try to avoid the security,
slow and ineffective security reviews. Security reviews of new
applications take weeks and then also produce
little actionable information. Bad user experience
basically the user experience for authentication authorizations are sometime
very much inconsistent and then there are gaps in risk
profile as well. So large parts of your technology portfolio
have undiscovered security and vulnerability. And then
all these things increasing the complexity and then the complexity is
becoming very unmanageable. So your vulnerability list
is very large and complex and it is growing rapidly when
you move towards or shifting
more application to the production. So how can I
diagnosis the diagnosis basically need to understand why
these symptoms are. And then basically it is misaligned
teams and reinventing to security.
And then basically reliability wheels. Security and reliability
involved very late in the game and many times there
sre a lot of lack of automation.
So for this revenue is definitely embed security
first mindset and then also reliability in
development teams embed reliability in development teams. So create security
and reliability standards and
accelerate that as well as automations of the security
scans and continuous security testing,
observability and resiliency we need to understand how
we have to make Devsecops is a kind of
healthy operating model. Embed sres into dev
teams to help implement security and reliability consideration
early in the design and development phases.
So let's understand what are the common practices for success
of Devsecops and SRE model. Enhancing DevsecOps operations has
become the SRE's top priority in light of digital transformation and
continuous escalation of attacks.
So all these aspects are very important to making Devsecops
and SRE models successful. Embed security and development, reliability driven
development automations and full stack observability.
So that's going to enhance Devsecops operations
and will be success of the devsecops
and SRE models. So moving into the
modernisation service model devsecops with
unit to embed reliability and security stack. So I
personally believe value stream management to play a big role from
planning to operations along with security and not just a
CI CD pipeline in digital transformation journey. So basically
value stream is a business practice that helps to
identify areas of improvements. Okay. So in
a process to make operations more efficient you need to
drive the business value. So value stream management
will help in security governance,
automations, security dashboard, analytics policy
as a code and pipelines. So from planning design phase to
monitor phase, the security will be embedded right
from the threat modeling you need to implement assessed
SAS DAS as well as chaos security experiments.
So security policy compliance then you need to continuously
schedule some of the audits, fuzzy test, Wafa penetration
testing, automated security monitoring right from the plan
to design all till entire operation,
pre production as well as production, how the security will be implemented
in the production pipeline and then as well as the reliability is
equally important which help to making
Devsecops production pipeline more secure and more reliable.
So combining value stream management with the implementation and expansion
of Devsecops and SI practices is becoming the best of breeder
approach to optimize every stage of deployment,
development and security embedding security
end to end. By having everything in a single pipeline with optimized
value streams and checks and controls for vulnerability and
any kind of misconfiguration, you are going to eliminate
any time consuming manual reviews and help
to accelerate the production pipeline. So here you
need to understand if there are certain shift left and
as well as you are moving some application to cloud and existing,
there are some existing process and tools you need to understand what are those
existing process with respect to people, what kind of skills
are there, what kind of culture, what are the processes available?
Are they implementing policy as a code? What are the
different security practices and then what is the technology?
You can't just start from the
scratch, you can basically learn and document the
existing process, adopt and then adopt and optimize, use those
existing processes and technology and
skill set and then implement devsecops pipeline.
So there are certain automated and
advanced techniques of Devsecops and SRE basically
to automate all your deployment pipeline.
So SREs have also devoted lot of their
time in attention to shifting right basically for
automations security monitoring in production and then giving continuous
feedback and then based on security monitoring data back
to developers. So you need to help and
then support basically automating with this advanced technique
to developers. There are security gates that you can automate
to stop workflow from slowing down
and then basically backing up monitoring and analytics,
monitoring analytics into the pipeline, then automate
close verification checks into
the Devsecops framework, reliability driven
development, code profiling, tool integration,
all these are very important techniques you can
automate and there are certain AI and ML techniques for threat
analysis. So advanced Devsecops framework take advantage of AI
and machine learning techniques to basically to streamline and simplify
speed up the complex devsecops stacks. There are
two examples I would give. One is basically collecting and
analyzing software while users are logging
into the system and then those logging information will
identify which aspect of software with
the bad actors are attempting to target. So this information
will give to the AI ML algorithm and then it can suggest
different code alteration and ads or any
architecture changes is required to proactively identify code vulnerability.
From a testing perspective, code changes can be run
through finely tuned machine learning tools to identify how
a particular change might affect other aspect of the
applications. So these are the examples where you can have AI backed
threat analysis and then you have to making
sure the chaos engineering with respect to security experiments
can be also integrated which will give the automations devsecops pipeline.
With the tools like Gremlin Chaos link cloud strike can
be used. So we'll provide more details in
the case study about the chaos link and then in modernisation side
you need to making sure automated security monitoring and analytics will
be implemented. So tracking or reliability gates and operation excellence
dashboard auto auditing and compliance
tools that streamlines basically a lot of compliance related reporting
and then deployment of the scripts validation. Basically there SRE certain
security context and flags you need to add while deploying
any pods in the Kubernetes and
security policy framework will continuously monitor and
assess that security flags. So overall
these advanced techniques for automation of devsecops and
SRE will provide a lot of cost savings potential. So Devsecops
and basically SRE automation when you complete,
which will help to lower likelihood of any catastrophic cybersecurity
incidents and reduction in number of
operational staff which basically improve your MTTr
and then reduce lot of p one tickets.
So let's understand what is SRE role
in Devsecops according to team topology.
So effective software teams are very essential for any
organization to deliver value continuously and sustainably.
But how do you build the best team for the organizations
and your specific goals with respect to culture,
what kind of skill set you have, what is the leadership and then
all development operation team. So there is a
bridge between Dev and Ops that SRE
is going to help bridging the gaps between operation
and development team. So SRE will help in
reliability and resiliency, kind of features
that has to be prioritized and then it
will help and then basically
automate reliability engineering security observability,
security feedback constantly giving implementing golden signals.
So SRE is going to help into bridging the gap between operations
and development team. The role of SRE is to collaborate, engage in value
added activities and create results that contribute to measurable
reliability improvements. So we understand what
are the different tasks with respect to SRE
role in Devsecops. What are the backlogs? The initial adjustment
to DevsecOps model requires a change in
mindset for developers and sres, but building in
vulnerability detection ahead of app deployment ultimately
lowered their sres around the security.
So SRE tasks in different phases for improving reliability,
resiliency and security. So basically with
respect to security, you need to secure the build pipeline,
secure entire deploying and also runtime protection.
So security side of it is mainly enabling
security code scanning security policy whenever the parts
are running in aks, kubernetes, cluster, configure and validate
deployment scripts for security context. So overall reliability
and dependability consists of seven attributes. And security
is combination of confidentiality, integrity and availability.
So confidentiality is mainly ensuring that the
information is inaccessible to unauthorized people,
commonly enforced through your encryption, ids and password,
and even with the two factor authentication. So integrity is
mainly safeguarding information and system from being
modified by unauthorized people. And of
course availability is ensuring that unauthorized people have access to the
information when needed.
And then safety, resiliency and maintainability is
basically contributing to overall dependability of the system.
So these are some of the tasks and backlogs you
need to continuously implement during every phases
right from the build, deployment and operations with respect
to specific areas of reliability, resiliency and security.
So next is plan
to develop the threat modeling
from all the phases. This is one of the case
study where building security and reliability
in devsecops pipeline. So right from the plan and develop you
start with the threat modeling. So there are certain threat modeling tools
you can implement and then these modeling tools will help to
identify the impact on the design various
design components. IDE security plugins will start on the immediate
code scanning pre commit hooks and then the secure coding standards.
Then commit to code. When you commit the code into
any repository you need to understand what is a secure repository,
are you implementing or not. Then there are static
code analysis, then dynamic
security unit, functional test,
dynamic security testing as well in the build and
test phases, dependency management, secure entire
secure pipeline. So infrastructure scanning,
cloud configuration validation, all those processes are important
right from the plan and develop to operate
phrase. This is the security model, ideal security model.
Apart from that in build and test there are certain plugins
created for this case study is basically YAML
validator and security context schema.
So you will have certain plugins.
In this case it's the azure pipeline. So we have to making
sure while developers are coding the
YAML files, YAML scripts, they are adding the relevant security context
and the security flags so that for example run
as a non root user or you should not have
access to root file system allow privilege,
access should be false. So all those things read
only root file systems. All those flags need to be
checked and validated while you are executing the
Azure pipeline. So continuously
checking that schema and then providing the report directly
to the developers that can be automated as
well. And then before going to the production,
when you are moving your code into non production environment
and also the production environment, there are chaos engineering and security
experiments can be conducted using certain
security experiment tools like chaos, linger and Gremlin
which will help to understand the
complexity. You need to basically navigate the complexity and then
any changes, any turbulence condition arises.
You need to be prepared for that, basically getting giving the confidence
to the developers and operation team.
So secondary vulnerability is one of the
area you can simulate with the fault
injection simulations also
for the reliability side of it.
So when there are relatively easy
and cost effective to resolve those
issues, as a result it greatly reduces the
total cost of deployment and then also the development because
you are identifying those security issues early in
the lifecycle. Also we need to
design certain tools with non security experts
in mind and making threat modeling more easier for developers
by providing clear guidance on creating and analyzing the threat model.
Okay now understand some of these security
policy can be enforced security center like
defender continuously discover new resources and
are being deployed across your workloads and assets,
whether they reconfigured according
to security best practices or not. And then the
tickets are flagged and then you will get the priority list of all the
recommendations. So this is basically enforcing
security policy using security center so it will
understand to help you to know the posture, continuously assess
and identify vulnerabilities, harden those resources and services with
security benchmark and detect and resolve threats
to resources and services.
So the major motivation for chaos engineering with
security experimentation is to gain confidence when systems are exposed
to any real life scenario attacks and that
can be transmitted to any cloud platforms with
various APIs and various attack surfaces.
So there is open source security experimentation tool like a chaos
slinger and then cloud strike. So SRE engineers basically
design security test scenario by
assessing risk based fault vulnerability assessment and then
create the various test scenario. So there
are certain components. One is controller which coordinates the
chaos injection experience then manager which receives the instruction
to conduct attacks based on specified attack modes.
Fault engine that will help to
knowledge on about cloud compliance and best practices.
Fault injector is responsible for implementing the security and fault
against the target cloud assets. Chaos monitor
is an important component which will
continuously monitors and the progress of attacks and early detect
any effects due to fault injection which will basically control
the blast radius. Chaos analyzer analyzes the
scores and generates the report. So possible recommendations
to include for updating any security rules for
security groups. If you are creating any energy rules
that can be added or enhanced because
of understanding after the simulation recommendations
and restriction of access as well as making
sure access control policies are intact. So all these are the
important recommendations can be derived from
chaos engineering experience. So various
test category there are spoofing of user
identity and other entities then tempering the s three
or RDS or redshift kind of a data store privacy breach or
data leaks there are misconfigured default security groups,
denial of service, destroy cloud services configurations and data
store and then elevation of privilege
to basically add some users and assets, account to existing
roles or groups with higher privileges and then check
if are you getting the security alerts or not. Basically checking the security
alerts is important aspect to validate all
those scenarios. Validating baseline security requirements, assign some
public IP to your components and compromise
with the internal resources and then check the security center
is validating and providing the alerts or not so
scenario are also there with respect to plays
of slinger and cloud strike. So one is scenario is
misconfigured port change and s three bucket permission changes. So you have
list of energy and then you need to select those energy
that are tagged with the opt in tag for chaos.
Randomly select any energy and apply random open
or close action based on the port configuration. Chaos linger
will help to apply those configuration changes and then there is
generator components which starts experimenting and performs port
equation and change the port and then you need to track
the changes. Verify events are triggered in security
center alerts or not. So that is one of the scenario that you
can test it and then three bucket permission changes.
You have to create one user new user get a list of all
the buckets in the cloud with respect to AWS or Azure or
GCP. So select random bucket from the set of buckets
in the cloud and configure attack points using these chaos
tools. Then simulate bucket unavailability
by changing bucket ACL from allow to deny and tool
will apply the configurated changes tool starts the
security experience and you will get the real time insights of the
chaos engineering tool set for the experience
and then you can monitor the AWS cloud watch and verify
events are triggered in security center or not.
So there are two things. One is testing
with the penetration testing and all those things. And also it is very important
to making sure you are creating
automated dashboard and continuously monitoring the
security and reliability. So apart
from that you have to create a customized Slo
monitoring dashboards. Use of automated root cause analysis
without the automation it is very difficult to because
lot of data is getting created and then it is very difficult to making sure
pinpointing any component and issues in the production system.
The use of automated root cause analysis is going to use
anomaly detection and suspect ranking. Automated machine
learning is applied using the python,
Skype, skit learn and all those things. So these are the automated dashboard that you
can create and then understand where is the anomaly, identify the
anomaly and detect it very rapidly.
The summary here is the
cultural aspect of SRE to embed security
along with reliability, implement SRE
practices in devsecops like security, chaos engineering, improve organization,
defense and zero trusted network and then making sure
reliability and security integrated into design and deployment
process. You have to making sure eliminate reliability
security related bottlenecks with continuous delivery in a production
pipeline and bridging the gaps with security practices while
ensuring quick and safe deliverables.
Eliminate all the siloed team and with increased collaboration
and shared security responsibility and reliability driven development.
These are the aspect takeaway from this session.
Hope you enjoyed this session and then I will leave with you.
Two quotes so SRE work is like being a part of the world's most
intense pit crew. We change the tires
of a race as it's going 100 mph.
So that's a quote from Andrew from
Sre at Google. And then
another quote is you need to continuously evolve and
change. What's dangerous is not to evolve. That's a quote from
Jeff Bezos. And then it's very important to understand
and then continuously evolve with the new changes.
So thank you very much. Hope you enjoyed this session.
And if any questions, please reach out to me either on LinkedIn,
Twitter and Klokta at
9987 on Discord. Thank you very much.