Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hi everyone,
a recent CatPoint survey identified that 66 percent of organizations use
2 5 monitoring hubs or build tools.
And it also identified that 24 percent of organizations have breached a
contractual service level agreement obligation during last 12 months.
Those are two very important numbers.
Having more than two to three or more than two observable dual is a little
problematic because you will struggle to identify a unified, a single truth state.
And the 24 percent number about contractual failures are alarming as well.
What it shows is even with multiple tools, the organizations are struggling.
My name is Indic Vimalasuriyath.
Welcome to Platform Engineering 2024.
My presentation is based out of AWS observability as code.
And I'll focus on leveraging Datadog to enable advanced platform engineering.
As part of my presentation, I will provide you a quick overview of challenges
faced in the modern day observability platforms due to distributed nature.
And we will quickly discuss about role of observability in platform engineering.
And we will touch about Datadog, which I will be using to
enable observability as code.
in AWS based infrastructures.
And we will look at how we can enable or implement observability as solutions,
real world implementation details.
And we will wrap it up with some of the best practices and
pitfalls you may need to avoid.
So quick intro about myself.
I, my name is Sindik Vimalsuri.
I'm based out of Colombo.
I live with my daughter and wife.
I'm coming from a very strong reliability engineering and
solution architect background.
My specialties are site reliability engineering, observability,
AIOps, and generative AI.
I'm currently employed at Virtus.
R.
I oversee technical delivery and capability development.
I'm a very passionate technical trainer and an energetic technical blogger.
I am, AWS Community Builder, part of Cloud Operations and also
Ambassador at DevOps Institute.
So to start, the distributed systems and modern day systems are evolving
fast and with that it's also
increasing the complexities we need to manage.
The traditional monolith systems have already moved into microservices.
And on from my data centers, which we use to host our applications,
we now move into cloud and some instance we move it from cloud to
the serverless architecture as well.
So what this has done is there's an expansion of data sources,
surging data volume and exponential rising failure scenarios.
So while it's good and while it add lot of value in modern day systems.
it has bringing in certain level of, increased failure scenarios as well.
So which, which resulted in difficulties in managing these environments and also it
has increased the role of observability.
So the traditional monitoring, which was mainly rule based, predefined approach of
looking at systems are no longer enough.
We have moved into observability, and that's mainly due to
distributed systems complexities, dynamic nature, and elasticity.
And most of our systems are now container based.
We have CI, CD deployments, integrations, we are connecting
to API gateways, and there's a increased, in cloud service providers.
Some of the systems are in poly cloud, multi cloud, leveraging
multi welder solutions.
And our systems have scalability challenges, which sometimes we are using
these cloud service providers solutions to, resolute, to, rectify and this require
observability into this as scalability triggers and we would like to detect
things much faster, resolve things much faster and we want to have end to end
visibility of our, the user journeys.
So these are some of the key things which is, increasing or which
result, the need for observability.
So if I quickly touch upon observability about logs are
the, what is your code is doing.
We typically use logs to, for our root cause understanding.
And we have metrics.
Metrics are nothing but numbers, a point in time performance of your application.
And metrics are generally used for alerting because it's a number,
it's a good way of allowing us to generate our alerts.
Thank you very much.
And we have tracing.
Tracing is what is your code is doing at a snap point.
You can see what that code is doing, method level, the time taken, and
even you can trace a request from a front end request to middleware to
the backend, which is database calls.
So tracing is very important.
It's allowing us to understand what to time taken in our code, and it's allowing
us to see what is our code is doing.
And on top of that, we are building alarms, which are our monitors.
We are developing a lot of dashboards.
We are coming up with synthetic monitors.
We are looking at real user monitoring, which is trying to understand what
is your actual end users are doing.
We are looking at the infrastructure, we are looking at the network, we are
looking at the security aspects of our ecosystem, and obviously we are
looking at how we can optimize cost, the infrastructure cost, AWS or the
cloud provider and other cost as well.
So pretty much these are the key fundamentals when it comes to
observability which you want to achieve.
So when it comes to platform engineering, I would like to take a one example.
Let's say you have one organization which is having around three
departments or line of businesses and each department have multiple
applications or engagements running.
And sometimes what can happen is that these engagements or the application
teams may find their own comfortable observability tool based solutions and
observability tools, alerting tools, log management tools, security aspects.
So they will leveraging things they prefer.
So what this happened is, over a period of time, we are losing the
unification, the unified approach.
It's very difficult to enable consistencies.
It's very difficult to build consistencies across these systems.
And it difficult to, whenever new applications and new
services are coming in.
We are having trouble and challenges.
So as part of platform engineering, what we are trying to do into this specific
area of observability is we want to build a unified observability strategy, which
is about, which is to make observability consistent across teams and ensure that
there's a centralized observability.
And we want to build efficient troubleshooting and incident response
framework, leveraging cross team collaborations, which will derive fast
and efficient root cause analysis.
And we also want to ensure we have scalability and reliability, leveraging
proactive monitoring, predictive capabilities, and other advanced
features, and things like capacity plan.
And we will also develop, enhance developer experience because developers
are important and they had to have the end to end visibility observability
into our production systems.
So this allows us to, there's a need of us to build like self
service observability solution.
Access and dashboards is specific to developers as well.
And we will.
Figuring, enable some services related to security and compliance
and cost official and cost tracking and those things as well.
So those are the key aspects or the key part of, observability
in platform engineering, and that will enable us to get the true.
value of platform engineering in observability area.
And let me talk about Datadog because here we are trying to use Datadog
on top of that AWS and trying to see what And what, capabilities of,
observability and, observability as code solutions which we can enable.
So here, if you look at it, we have multiple data sources,
like Datadog supports around 750 plus, source integrations.
It supports logs, traces, metrics, and metadata sessions as well.
So it also provides a lot of dashboards.
alerts, collaboration, mobile, monitoring workflows, what dog, which
is the, it's on AI platform and also integration into open telemetry.
So some of the use cases, observable use cases, data dog is providing our
infrastructure, monitoring applications, performance, monitoring, unified,
universal monitoring, log management.
The continuous integration visibility, continuous profiling, real user
monitoring, network monitoring, synthetic monitoring, cloud
security, application security, observability pipeline, and many more.
So all of this allowed our, the key partners or the teams like development
teams, IT operations teams, Security team, support team and business to
have a full view of our ecosystem.
So when you are building a AWS base, observability driven platform
engineering solution, there are a few things we need to be, focused.
So one is our centralized Datadog integration.
So if it is the sim standard.
workloads.
We might have to integrate, example for EC2s, the agents as well.
And if it is containers, then we'll have to do the integration as well.
If it is, less like AWS Lambdas, we might have to intercept the Lambda
the libraries and then enable this AWS integration, especially for ACM.
And then with this, what we are trying to do is enable, the key, aspects or
the elements of observability, which is about enabling the standard matrix
and the dashboards and, locks, which we want to make sure it's consistent.
And also the traces, the spans and those things so that we can troubleshoot
and identify what's happening at.
microservices hosted in AWS or in the containers.
Or even AWS Lambda as well.
So on top of that, Datadog allows us to develop alerting and incident workloads.
And also there are other services like looking at the performance and
scalability, resource management, cost management, and other CI, CD integrations.
And looking at things like automated setup and looking at the governance,
how we can build different access controls and how we can enable
compliance control and other developer related services such as developer
access, documentations and trainings.
So those will allow us to develop a comprehensive observability platform
engineering framework for AWS.
so moving on.
So when you look at observability as code, so there are key elements.
One is, treat your observability configuration as codes, which
is about agent integrations or integration with Datadog in AWS.
Looking at metrics, logs and traces and how you can set up, them with AWS
based ecosystem and how we can leverage Datadog to achieve those things.
And next one is about versioning and management.
We want to do versioning control.
We can reuse the components like Git.
So which is allowing us to do version control and tracking our, changes
and increased the collaborations.
Of course, this, the observability as code solutions, we will have to enable with a
CI, CD related deployments, automations.
So that way we can, while the development teams are developing these.
solutions.
They can push all these, the observability as code solutions into
a multiple pre live environments.
From there, we can start deploying and until we deployed into production.
So this we can enable consistency and standardization, implement
standardized templates, come up with some quality gates, and we can come
up with modules for observability configuration, and apply some consistent
settings across multiple environments.
And this will enable us to do a comprehensive observability
monitoring and maintenance.
conduct reviews, validations of our configuration, implementing test
frameworks and, and to ensure that we have a proper quality gauge as
when we are moving from our pre life to the production environment.
And also we can have a standardization in documentation.
And this will allow a lot of collaborations in multiple teams.
And the whole idea is that we will not that data doc, but we will develop our
solutions as observability as code.
And this code has, can be moved from one environment, one environment until.
We move into production and different teams can contribute this we can do
the version tracking we can enable Configure the collaboration and
we can have better quality as well
So moving on some of the benefits of observability as code solution
when you are thinking of aws It will improve accuracy, there's
enhanced visibility because, you are working as code and because of that
visibility drives a lot of accuracy.
So it's code, you can do a lot of, a lot of validations and a lot of, and
even a lot of collaborations as well.
While we enable a lot of corroborations, we can build a lot of quality gates on
top of our code and it's much faster.
So we can have much faster turnaround time.
It can be consistent across the environment.
So that's one of our challenges.
We may lack Observability, when it comes to the consistency in environments,
example, you may have some sort of, some degree of observability in your
production environment, but it might not be, identical in your lower environment.
And what that happens, developers will have a challenge.
And even there can be misunderstanding, misinterpretations, and there can
be some challenges in, achieving what we're trying to achieve.
And obviously observability as code Promote better collaboration and
it's allow us to do a continuous improvements in our systems.
So moving on.
with AWS there are a lot of observability called, tools, which we can use
things like Terraform, Ansible, j pmi, or AWS code formation as well.
And this systems we are going to leverage Terraform, which is widely
being used, across the board.
And what we are trying to achieve is we want to ride entire
observability as code solutions using.
Terraform and then ship push into Datadog.
And what we are trying to achieve is we want to create things like,
enable the Datadog integration.
So we can completely automate AWS and Datadog integration using Terraform.
And we want to ship this logs into AWS based logs into Terraform,
either it's your EC2s, the workloads, or the container based workloads,
or your serverless framework.
So all of those things we can ship to Datadog.
And we can streamline that using Terraform and enabling APM in your containers
or the EC2 based workloads or even in your serverless based frameworks.
We can leverage observability as code solutions.
And then we want to build common metrics and enable common metrics and
develop common dashboards So this is very important because generally the
developers and the devops teams srs They will refer some of these standard
dashboards And what we can do is it's not just these standard dashboards
are available in your production, but you can enable those in different to
your pre live environments as well.
Example, test and development environment, test environment into an integration
environments, staging environment.
So this way you can have unification and developers are able to look at things.
And before they are trying to see how their code is going to behave
in production, they have the ability of trying to understand
how things are behaving in the pre live environments as well.
And then we can automate the.
synthetic monitoring creation, the synthetic monitors,
this is very important.
This about creating production end user and mimicking them and what actual end
users are doing so that way we have more control of trying to see how our
application is behaving and we can of course generate a lot of alerts on top of
these synthetic monitors or the metrics or some of these log events as well.
And we can automate the entire account creation, the access
control in as observable as code.
We can enable the CI CD pipeline integrations, and in some instances, the
distributed insulating setup as well.
And one of the important aspects when it comes to site reliability engineering is
define your service level indicators and define your service level objectives.
DataDog as an observability tool is providing a very good capability of
defining service level objectives.
You can either create with your good events divided by total, or you can
create SLOs based on your synthetic pointers based on to monitor uptime and
things like that, or you can have SLOs based on different time slices as well.
So all of these options which DataDog is providing, you can automate, you can write
your, we can get our developers to write.
SLOs and write it using observability as code solutions and that can
be navigated through multiple product pre production environments
until it moves into production.
So that way we can enable consistency across the Service level objectives
from pre live environment to the production environment.
We are obviously able to use automated incident management workflows, enable that
using our observability as code solutions.
And finally, our compliance and governance control as well.
So what we will do is we will look at some of the cost snippets of how we can
do some of the terraform code snippets.
So that will give you some idea about these integrations.
So if I take a step back, so if I want to Define or develop the observability
as a code solution leveraging datadog for aws So what we will do is the step one
is we have to define our observability configuration What we'll generally do is
we'll identify and create configuration files Covering the key observability
components such as metrics logs and traces dashboards Events and most of
the things I have covered part of the previous slide and we will look at version
control We can have our the observability as code our code in git And then we can
manage the versioning, we can enable collaboration, we can enable branching,
and we can follow merging strategies.
So that way that we can have more comprehensive control of our observability
And we can build a lot of test solution as well, the unit testing and other
form of validations such as syntax.
And other validations as well.
And we can have to then the fourth step is to enable this with the continuous
integration, integrate with the CI pipelines, enable it automatically
run test on to ensure these things go through multiple environments
before we move into production.
So we can have multiple quality gates and part of before CI, you can do
the automated deployments as well.
And then we'll have to build the monitoring and validation.
Monitor the automated deployments moving across multiple, instances,
and then validation process.
We can ignore multiple teams who are owning these pre live anonymous,
like the developers, the testers, the integrate into end-to-end teams
and into an integration teams and, even the DevOps and SRA teams.
So they can go through this, observability as called solutions.
Those outputs in multiple environments, pre-live environments.
for continuous review and validations.
So this will typically enable your meters and compliance as well.
So if you look at, because this, we are trying to focus on Terraform.
So this is a typical Terraform folder structure.
So you will have your observability as code, the folder, and then
we'll have the deployments.
You will have main A TF and your providers for the, which will enable you to, do the
provider configuration with data talks.
You can define your variables and you can version your Terraform as well.
And then we can have a set of your, the YAML files, which are looking
at your creating monitors, service level objectives, dashboards,
creating maintenance windows, or other integrations other.
the form of automations, which you want to enable.
So if you look at typically, if you want to integrate, the data dog, what
we'll do is we'll have to enable data dog agent running in your workloads,
or, we can do a direct, integration with our serverless framework as well.
So here we can use data form.
to do this configuration.
So you can see we are using Datadog, APS and other keys.
So you don't have to actually, what you can do is we can use AWS secret
manager, and then we can read these keys part of a reading secret manager
so that we have a better mechanism of handling secret keys as well.
And we are able to, Some of our logs in AWS, which is typically
reside in cloud work, and then we can ship these logs into data dog.
And while doing that, we can do a lot of, regular expression.
We can level at them to split our knocks and then.
Some of those important locks and we can filter locks based on the lock type,
like it's on the error or the warnings or likewise, so we can identify extract
some of these different lock exceptions and said, so we have a lot of exceptions.
So we had to be mindful that data dog is a paid solution and it can be sometimes
a little expensive when we are using.
Lot of, log related, when you are trying to ship the entire log into Datadog.
So what you can do is you can use a lot of filtering.
So that way that you can save some of the cost when it comes to
using Datadog for log management.
moving on, so one of the other important things which I personally like is lot
develop lot of synthetic monitors.
So we can develop a lot of synthetic monitors to test, different pages
or some of user journeys and we can provide some, the failure, the retries.
And here again, those are very important things.
So this will, the synthetic monitors will actually try to mimic.
end user in our production systems, so it has a lot of value and, so we
can again use, Terraform to get this.
And one thing I, personal of my favorite is creating alerts.
So we'll have to create alerts based on, we, first it's about
service level objective based alerts based on your caching layer.
you have to create a list based on your, the middleware, the
logic layer, the service layer.
and you have to create some of these alerts based on your other
end to end journeys as well.
So it's, so what you had to do is your alert has to be more or less
correlated with service level objectives.
So it's level objectives will correlate with end user experience, and you
can, of course, have a lot of triggers when it related to infrastructure
monitoring, network monitoring and other anomaly monitoring as well.
Alert can be standard is that, the metric driven alerts.
Or you can enable some of the alerts like, AI capabilities like anomaly
related alerts or forecasting.
And then alerts, you can define what are the thresholds, or you can let,
in case of anomaly detection and forecast anomaly detection, especially
let Datadog handle those things, the thresholds, diamond thresholds.
Based on baseline.
So you can configure what are the teams these alerts has to go through and
you can come up with a lot of tagging.
So the using observability as code solution in AWS in this area that
will provide you a lot of options, a lot of standardizations and that
will add a lot of value in a large organization managing multiple teams.
And then, most of the teams assuming they use some workloads like EC2.
So the containers.
Or, yeah, they are using serverless based solutions, like Lambdas.
And, pretty much, sometimes your, what you monitor might be same,
if your caching metrics, if your, the service layer related metrics,
which are your, the backend related metrics, frontend related metrics,
if you, Enable real user monitoring.
So there can be a lot of common metrics.
And on top of that, we can even absorb a lot of common dashboards as well, which
can be aligned, which can be even Carter into multiple different end user journeys.
Because the technology is same, the what's happening at the background
is same, so we can create some of these very standard dashboards.
So those these dashboards can be using observability as code solution.
We can move in different environments in AWS like your pre live.
The staging of production.
So that way you have a lot of consistence.
You enable, you allow your developers to go and, look at, how the
application is behaving in multiple environments, not only in production.
So this is a very value added thing and with this we can make
it lot of, standardization.
So when the same capabilities we can, allow.
Other engagements, other teams use as well.
So whenever new applications, new engagements are getting onboarded,
things will be very straightforward.
They can just simply plug into the Git, look at these Terraform scripts,
observability as code solutions, and then integrate these in their AWS environment.
So moving on, one of the other thing is enabling APM,
application performance monitoring.
Here again, if it's like the workload based, Or using open telemetry or
serverless blade integrating apm with your serverless frameworks.
We can leverage there are observability as code solutions in aws as well And
finally, some of these, the platform engineering typical use cases are
account creation, access control, and we can use observability as code solution
here in AWS, but we can do is we can have a different access layers for your
dashboards to the monitors, your logs.
synthetic ability to recreate the admin users, read write
privileges, account creations, cost views, and all of those things.
So we can leverage Terraform and which mean your observability as
code solutions with AWS, leveraging Datadog to achieve these capabilities.
And finally, of course, like we can use this to streamline our CI CD
pipeline integrations using Terraform.
So this will be very, beneficial because we want to ensure this
observability as code solution goes through multiple environments
so that we can take this benefit.
And, in case if required, there's a specific customization needed
for your distributed tracing setup, we can leverage this as well.
And as I said, one of the other important thing is defining service level objectives
and sometimes we have seen there's a common framework, common aspects when
we are defining because in a typical organizations we are using some of these,
the AWS based workloads, the containers and serverless frameworks and that
enable us to leverage some of these, standardizations and the common factors.
So using SLO as code, what we want is instead of defining SLOs at
last, or developing SLOs as once the application goes live, we want our
developer community to think about SLOs, service level objectives, when
they're designing, developing, testing, and try deploying these, the code into
multiple environments, so they can write the SLO as part of your code.
Okay.
And then ship and let it go through multiple environments until production
even to add production so that way that we can monitor our service level objectives
and even example in one of our pre live environment we can do performance
testing we can do soft testing we can do security testing we can do other
form of stress testing and we can see how our SLOs are behaving we can mimic
some of the high peak traffic the high concurrent traffic and simulate some of
or, spike, or the flood of traffics and different type of user behaviors, and we
can see our, how our SLOs are behaving.
So this is a very powerful feature.
So typically what happened was people define SLOs in Datadog ecosystems
at large state for production.
But this way, from the very beginning in multiple environments,
from the start, we can do that so that there's a lot of visibility.
Okay.
Consistence and we can achieve a lot of good things by doing this.
And obviously Datadog provides whenever there are, events created, we can invoke
some webhooks and from there we can, invoke some remediation engines running
in our AWS workloads or serverless framework like lambdas with that we
can do a lot of, self healing, we can build a lot of self healing solutions.
We can do a lot of automation solutions.
So observability as code is providing those capabilities as well.
And then we can, of course, build our compliance and governance
control based out of this observability as code solution.
And finally, in case, when, whenever we have any scheduled
deployment, scheduled downtimes.
or downtime related to back end teams, which will have an
impact on some of our monitors.
We can enable maintenance windows, or whenever there are CI CD deployment,
unless if you are using a very sophisticated deployment strategy, you
are calling out a certain level of outage.
You can create your maintenance windows as observability as code,
and then you can suppress your traps.
So you can integrate that part of your the release the CACD pipelines so
that way you can test it as well and whenever you are doing this production
deployment automatically pipeline will.
enable the maintenance window, which is suppressing your relevant traps during
that particular deployment window.
And after that you can enable it and do a testing as well.
So this whole aspect, instead of doing manually, you can leverage
the observability as code solutions in AWS to achieve this thing.
So moving on, so those are some of the things which we can do as part
of observability as code, which will support our ultimate goal of developing
a comprehensive observability related components, services, things when
it comes to platform engineering.
So while you're doing this, I think it's very important for you to
be mindful of typical, issues and ensure that you are ready for that.
And one of the key issue is whatever we do these days, we had to ensure that we
have a way to measure progress and how, whether what we have done is actually
allowing business to see some benefits.
So we had to look at your mean time for detection, mean time for resolution,
mean time for between failures.
So we had to ensure that there is a way for we can extract these
metrics so that we can measure this when we are going through our
observability as called journey in AWS.
And we had look at how, whether this is helping us to improve reliability,
availability and how this is, helping us to enhance the customer experience.
So this can be tagged into your service level objectives, can improve your service
level objectives, in a periodic fashion.
So those are some of these values we have to bring in part
of these kinds of solutions.
And also, can you improve your developer velocity?
Can this have more, positive impact on your DORA metrics, which is like
lead time for change and reduce increasing the change, Frequency, make
your production volatile while you manage your customer experience using
service level objectives and reduce the deployment failure rate and those things.
So you had to think about those things and capturing those metrics and some
aspects of this, you can of course use observability as code solutions as well.
So this, you can, instead of letting teams is coming up, up with Their own
solutions you can make this standard make it also observability as code solution.
So That way this can be tracked properly as well And finally before we wrap up
so observability as code has a lot of advantages, but there are some best
practices You have to be mindful.
So number one I would suggest is managing your secure api keys So when it comes to
Datadog integration using AWS, you will have to look at, there can be your Datadog
API keys and AWS based credentials.
So those you will have to leverage, and access in a very, secure way.
And you will have to definitely use version control because that's one
of the key advantage, key elements bringing in observability as code.
So think about ensure you don't miss out and you do the version control.
Keep your cord in the gig and those are best practices and you obviously will
have to do a modularization instead of you know going a big band way.
Modularize it to your agent data dog integrations, APM enable, Real
User Monitoring enables, Metric enablement, Dashboard creation,
Monitors, Synthetic monitors, Log pipelines, Governance, Access controls.
So that way there's modularization so that you can take those advantage.
And this observability as code solution has to be plugging
with your deployment pipelines.
It has to go through multiple environments and until you move into production.
So that this way you can take the True potential of this observability
being written as code for AWS.
And you'll have to look at, role based access controls, enable
them, based on your user community.
You'll have to look at how you do that as well.
And this has to be a centralized solution.
So it has, this should provide centralized output, centralized
dashboard, centralized architecture.
the alerting mechanisms, centralized logging systems,
standardization across everything.
That is very important.
So this has to reside in standardization.
So every engagement, every line of business, every team in your organizations
are pretty much following a defined and standardization with quality
gates and with, the ability to review.
Thank you.
So this should come up with a lot of documentations, reviews, so
that will enable collaboration.
And Datadog has a lot of integrations, sometimes leveraging those direct
integrations will save you time when it comes to AWS like serverless
integration, container integration, and with other different AWS
service integrations as well.
So all of this is driving collaboration.
So you have to ensure that these, taken and taken into your, multiple
teams and work with them to enable this collaboration so that you can get this
value out of this, enabling observability as code for your AWS ecosystems.
With that, I would like to take a close my presentation.
I hope you enjoy it and we'll see you in another future presentation.
So thank you very much for taking your time.
And there are great.
amount of presentations done by different presenters, part of platform engineering
2022 and 24 by organized by con 42.
I'm requesting everyone to look at these presentations as well.
So thanks for taking time.
Take care.
Bye.