AWS Observability as Code: Leveraging Datadog for Advanced Platform Engineering

Video size:

Abstract

Observability is a core challenge in modern platform engineering, complicated by diverse tools and dynamic environments. Leveraging AWS and Datadog as code can resolve these issues, creating a streamlined observability platform that enhances monitoring, reliability, and system management.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hi everyone, a recent CatPoint survey identified that 66 percent of organizations use 2 5 monitoring hubs or build tools. And it also identified that 24 percent of organizations have breached a contractual service level agreement obligation during last 12 months. Those are two very important numbers. Having more than two to three or more than two observable dual is a little problematic because you will struggle to identify a unified, a single truth state. And the 24 percent number about contractual failures are alarming as well. What it shows is even with multiple tools, the organizations are struggling. My name is Indic Vimalasuriyath. Welcome to Platform Engineering 2024. My presentation is based out of AWS observability as code. And I'll focus on leveraging Datadog to enable advanced platform engineering. As part of my presentation, I will provide you a quick overview of challenges faced in the modern day observability platforms due to distributed nature. And we will quickly discuss about role of observability in platform engineering. And we will touch about Datadog, which I will be using to enable observability as code. in AWS based infrastructures. And we will look at how we can enable or implement observability as solutions, real world implementation details. And we will wrap it up with some of the best practices and pitfalls you may need to avoid. So quick intro about myself. I, my name is Sindik Vimalsuri. I'm based out of Colombo. I live with my daughter and wife. I'm coming from a very strong reliability engineering and solution architect background. My specialties are site reliability engineering, observability, AIOps, and generative AI. I'm currently employed at Virtus. R. I oversee technical delivery and capability development. I'm a very passionate technical trainer and an energetic technical blogger. I am, AWS Community Builder, part of Cloud Operations and also Ambassador at DevOps Institute. So to start, the distributed systems and modern day systems are evolving fast and with that it's also increasing the complexities we need to manage. The traditional monolith systems have already moved into microservices. And on from my data centers, which we use to host our applications, we now move into cloud and some instance we move it from cloud to the serverless architecture as well. So what this has done is there's an expansion of data sources, surging data volume and exponential rising failure scenarios. So while it's good and while it add lot of value in modern day systems. it has bringing in certain level of, increased failure scenarios as well. So which, which resulted in difficulties in managing these environments and also it has increased the role of observability. So the traditional monitoring, which was mainly rule based, predefined approach of looking at systems are no longer enough. We have moved into observability, and that's mainly due to distributed systems complexities, dynamic nature, and elasticity. And most of our systems are now container based. We have CI, CD deployments, integrations, we are connecting to API gateways, and there's a increased, in cloud service providers. Some of the systems are in poly cloud, multi cloud, leveraging multi welder solutions. And our systems have scalability challenges, which sometimes we are using these cloud service providers solutions to, resolute, to, rectify and this require observability into this as scalability triggers and we would like to detect things much faster, resolve things much faster and we want to have end to end visibility of our, the user journeys. So these are some of the key things which is, increasing or which result, the need for observability. So if I quickly touch upon observability about logs are the, what is your code is doing. We typically use logs to, for our root cause understanding. And we have metrics. Metrics are nothing but numbers, a point in time performance of your application. And metrics are generally used for alerting because it's a number, it's a good way of allowing us to generate our alerts. Thank you very much. And we have tracing. Tracing is what is your code is doing at a snap point. You can see what that code is doing, method level, the time taken, and even you can trace a request from a front end request to middleware to the backend, which is database calls. So tracing is very important. It's allowing us to understand what to time taken in our code, and it's allowing us to see what is our code is doing. And on top of that, we are building alarms, which are our monitors. We are developing a lot of dashboards. We are coming up with synthetic monitors. We are looking at real user monitoring, which is trying to understand what is your actual end users are doing. We are looking at the infrastructure, we are looking at the network, we are looking at the security aspects of our ecosystem, and obviously we are looking at how we can optimize cost, the infrastructure cost, AWS or the cloud provider and other cost as well. So pretty much these are the key fundamentals when it comes to observability which you want to achieve. So when it comes to platform engineering, I would like to take a one example. Let's say you have one organization which is having around three departments or line of businesses and each department have multiple applications or engagements running. And sometimes what can happen is that these engagements or the application teams may find their own comfortable observability tool based solutions and observability tools, alerting tools, log management tools, security aspects. So they will leveraging things they prefer. So what this happened is, over a period of time, we are losing the unification, the unified approach. It's very difficult to enable consistencies. It's very difficult to build consistencies across these systems. And it difficult to, whenever new applications and new services are coming in. We are having trouble and challenges. So as part of platform engineering, what we are trying to do into this specific area of observability is we want to build a unified observability strategy, which is about, which is to make observability consistent across teams and ensure that there's a centralized observability. And we want to build efficient troubleshooting and incident response framework, leveraging cross team collaborations, which will derive fast and efficient root cause analysis. And we also want to ensure we have scalability and reliability, leveraging proactive monitoring, predictive capabilities, and other advanced features, and things like capacity plan. And we will also develop, enhance developer experience because developers are important and they had to have the end to end visibility observability into our production systems. So this allows us to, there's a need of us to build like self service observability solution. Access and dashboards is specific to developers as well. And we will. Figuring, enable some services related to security and compliance and cost official and cost tracking and those things as well. So those are the key aspects or the key part of, observability in platform engineering, and that will enable us to get the true. value of platform engineering in observability area. And let me talk about Datadog because here we are trying to use Datadog on top of that AWS and trying to see what And what, capabilities of, observability and, observability as code solutions which we can enable. So here, if you look at it, we have multiple data sources, like Datadog supports around 750 plus, source integrations. It supports logs, traces, metrics, and metadata sessions as well. So it also provides a lot of dashboards. alerts, collaboration, mobile, monitoring workflows, what dog, which is the, it's on AI platform and also integration into open telemetry. So some of the use cases, observable use cases, data dog is providing our infrastructure, monitoring applications, performance, monitoring, unified, universal monitoring, log management. The continuous integration visibility, continuous profiling, real user monitoring, network monitoring, synthetic monitoring, cloud security, application security, observability pipeline, and many more. So all of this allowed our, the key partners or the teams like development teams, IT operations teams, Security team, support team and business to have a full view of our ecosystem. So when you are building a AWS base, observability driven platform engineering solution, there are a few things we need to be, focused. So one is our centralized Datadog integration. So if it is the sim standard. workloads. We might have to integrate, example for EC2s, the agents as well. And if it is containers, then we'll have to do the integration as well. If it is, less like AWS Lambdas, we might have to intercept the Lambda the libraries and then enable this AWS integration, especially for ACM. And then with this, what we are trying to do is enable, the key, aspects or the elements of observability, which is about enabling the standard matrix and the dashboards and, locks, which we want to make sure it's consistent. And also the traces, the spans and those things so that we can troubleshoot and identify what's happening at. microservices hosted in AWS or in the containers. Or even AWS Lambda as well. So on top of that, Datadog allows us to develop alerting and incident workloads. And also there are other services like looking at the performance and scalability, resource management, cost management, and other CI, CD integrations. And looking at things like automated setup and looking at the governance, how we can build different access controls and how we can enable compliance control and other developer related services such as developer access, documentations and trainings. So those will allow us to develop a comprehensive observability platform engineering framework for AWS. so moving on. So when you look at observability as code, so there are key elements. One is, treat your observability configuration as codes, which is about agent integrations or integration with Datadog in AWS. Looking at metrics, logs and traces and how you can set up, them with AWS based ecosystem and how we can leverage Datadog to achieve those things. And next one is about versioning and management. We want to do versioning control. We can reuse the components like Git. So which is allowing us to do version control and tracking our, changes and increased the collaborations. Of course, this, the observability as code solutions, we will have to enable with a CI, CD related deployments, automations. So that way we can, while the development teams are developing these. solutions. They can push all these, the observability as code solutions into a multiple pre live environments. From there, we can start deploying and until we deployed into production. So this we can enable consistency and standardization, implement standardized templates, come up with some quality gates, and we can come up with modules for observability configuration, and apply some consistent settings across multiple environments. And this will enable us to do a comprehensive observability monitoring and maintenance. conduct reviews, validations of our configuration, implementing test frameworks and, and to ensure that we have a proper quality gauge as when we are moving from our pre life to the production environment. And also we can have a standardization in documentation. And this will allow a lot of collaborations in multiple teams. And the whole idea is that we will not that data doc, but we will develop our solutions as observability as code. And this code has, can be moved from one environment, one environment until. We move into production and different teams can contribute this we can do the version tracking we can enable Configure the collaboration and we can have better quality as well So moving on some of the benefits of observability as code solution when you are thinking of aws It will improve accuracy, there's enhanced visibility because, you are working as code and because of that visibility drives a lot of accuracy. So it's code, you can do a lot of, a lot of validations and a lot of, and even a lot of collaborations as well. While we enable a lot of corroborations, we can build a lot of quality gates on top of our code and it's much faster. So we can have much faster turnaround time. It can be consistent across the environment. So that's one of our challenges. We may lack Observability, when it comes to the consistency in environments, example, you may have some sort of, some degree of observability in your production environment, but it might not be, identical in your lower environment. And what that happens, developers will have a challenge. And even there can be misunderstanding, misinterpretations, and there can be some challenges in, achieving what we're trying to achieve. And obviously observability as code Promote better collaboration and it's allow us to do a continuous improvements in our systems. So moving on. with AWS there are a lot of observability called, tools, which we can use things like Terraform, Ansible, j pmi, or AWS code formation as well. And this systems we are going to leverage Terraform, which is widely being used, across the board. And what we are trying to achieve is we want to ride entire observability as code solutions using. Terraform and then ship push into Datadog. And what we are trying to achieve is we want to create things like, enable the Datadog integration. So we can completely automate AWS and Datadog integration using Terraform. And we want to ship this logs into AWS based logs into Terraform, either it's your EC2s, the workloads, or the container based workloads, or your serverless framework. So all of those things we can ship to Datadog. And we can streamline that using Terraform and enabling APM in your containers or the EC2 based workloads or even in your serverless based frameworks. We can leverage observability as code solutions. And then we want to build common metrics and enable common metrics and develop common dashboards So this is very important because generally the developers and the devops teams srs They will refer some of these standard dashboards And what we can do is it's not just these standard dashboards are available in your production, but you can enable those in different to your pre live environments as well. Example, test and development environment, test environment into an integration environments, staging environment. So this way you can have unification and developers are able to look at things. And before they are trying to see how their code is going to behave in production, they have the ability of trying to understand how things are behaving in the pre live environments as well. And then we can automate the. synthetic monitoring creation, the synthetic monitors, this is very important. This about creating production end user and mimicking them and what actual end users are doing so that way we have more control of trying to see how our application is behaving and we can of course generate a lot of alerts on top of these synthetic monitors or the metrics or some of these log events as well. And we can automate the entire account creation, the access control in as observable as code. We can enable the CI CD pipeline integrations, and in some instances, the distributed insulating setup as well. And one of the important aspects when it comes to site reliability engineering is define your service level indicators and define your service level objectives. DataDog as an observability tool is providing a very good capability of defining service level objectives. You can either create with your good events divided by total, or you can create SLOs based on your synthetic pointers based on to monitor uptime and things like that, or you can have SLOs based on different time slices as well. So all of these options which DataDog is providing, you can automate, you can write your, we can get our developers to write. SLOs and write it using observability as code solutions and that can be navigated through multiple product pre production environments until it moves into production. So that way we can enable consistency across the Service level objectives from pre live environment to the production environment. We are obviously able to use automated incident management workflows, enable that using our observability as code solutions. And finally, our compliance and governance control as well. So what we will do is we will look at some of the cost snippets of how we can do some of the terraform code snippets. So that will give you some idea about these integrations. So if I take a step back, so if I want to Define or develop the observability as a code solution leveraging datadog for aws So what we will do is the step one is we have to define our observability configuration What we'll generally do is we'll identify and create configuration files Covering the key observability components such as metrics logs and traces dashboards Events and most of the things I have covered part of the previous slide and we will look at version control We can have our the observability as code our code in git And then we can manage the versioning, we can enable collaboration, we can enable branching, and we can follow merging strategies. So that way that we can have more comprehensive control of our observability And we can build a lot of test solution as well, the unit testing and other form of validations such as syntax. And other validations as well. And we can have to then the fourth step is to enable this with the continuous integration, integrate with the CI pipelines, enable it automatically run test on to ensure these things go through multiple environments before we move into production. So we can have multiple quality gates and part of before CI, you can do the automated deployments as well. And then we'll have to build the monitoring and validation. Monitor the automated deployments moving across multiple, instances, and then validation process. We can ignore multiple teams who are owning these pre live anonymous, like the developers, the testers, the integrate into end-to-end teams and into an integration teams and, even the DevOps and SRA teams. So they can go through this, observability as called solutions. Those outputs in multiple environments, pre-live environments. for continuous review and validations. So this will typically enable your meters and compliance as well. So if you look at, because this, we are trying to focus on Terraform. So this is a typical Terraform folder structure. So you will have your observability as code, the folder, and then we'll have the deployments. You will have main A TF and your providers for the, which will enable you to, do the provider configuration with data talks. You can define your variables and you can version your Terraform as well. And then we can have a set of your, the YAML files, which are looking at your creating monitors, service level objectives, dashboards, creating maintenance windows, or other integrations other. the form of automations, which you want to enable. So if you look at typically, if you want to integrate, the data dog, what we'll do is we'll have to enable data dog agent running in your workloads, or, we can do a direct, integration with our serverless framework as well. So here we can use data form. to do this configuration. So you can see we are using Datadog, APS and other keys. So you don't have to actually, what you can do is we can use AWS secret manager, and then we can read these keys part of a reading secret manager so that we have a better mechanism of handling secret keys as well. And we are able to, Some of our logs in AWS, which is typically reside in cloud work, and then we can ship these logs into data dog. And while doing that, we can do a lot of, regular expression. We can level at them to split our knocks and then. Some of those important locks and we can filter locks based on the lock type, like it's on the error or the warnings or likewise, so we can identify extract some of these different lock exceptions and said, so we have a lot of exceptions. So we had to be mindful that data dog is a paid solution and it can be sometimes a little expensive when we are using. Lot of, log related, when you are trying to ship the entire log into Datadog. So what you can do is you can use a lot of filtering. So that way that you can save some of the cost when it comes to using Datadog for log management. moving on, so one of the other important things which I personally like is lot develop lot of synthetic monitors. So we can develop a lot of synthetic monitors to test, different pages or some of user journeys and we can provide some, the failure, the retries. And here again, those are very important things. So this will, the synthetic monitors will actually try to mimic. end user in our production systems, so it has a lot of value and, so we can again use, Terraform to get this. And one thing I, personal of my favorite is creating alerts. So we'll have to create alerts based on, we, first it's about service level objective based alerts based on your caching layer. you have to create a list based on your, the middleware, the logic layer, the service layer. and you have to create some of these alerts based on your other end to end journeys as well. So it's, so what you had to do is your alert has to be more or less correlated with service level objectives. So it's level objectives will correlate with end user experience, and you can, of course, have a lot of triggers when it related to infrastructure monitoring, network monitoring and other anomaly monitoring as well. Alert can be standard is that, the metric driven alerts. Or you can enable some of the alerts like, AI capabilities like anomaly related alerts or forecasting. And then alerts, you can define what are the thresholds, or you can let, in case of anomaly detection and forecast anomaly detection, especially let Datadog handle those things, the thresholds, diamond thresholds. Based on baseline. So you can configure what are the teams these alerts has to go through and you can come up with a lot of tagging. So the using observability as code solution in AWS in this area that will provide you a lot of options, a lot of standardizations and that will add a lot of value in a large organization managing multiple teams. And then, most of the teams assuming they use some workloads like EC2. So the containers. Or, yeah, they are using serverless based solutions, like Lambdas. And, pretty much, sometimes your, what you monitor might be same, if your caching metrics, if your, the service layer related metrics, which are your, the backend related metrics, frontend related metrics, if you, Enable real user monitoring. So there can be a lot of common metrics. And on top of that, we can even absorb a lot of common dashboards as well, which can be aligned, which can be even Carter into multiple different end user journeys. Because the technology is same, the what's happening at the background is same, so we can create some of these very standard dashboards. So those these dashboards can be using observability as code solution. We can move in different environments in AWS like your pre live. The staging of production. So that way you have a lot of consistence. You enable, you allow your developers to go and, look at, how the application is behaving in multiple environments, not only in production. So this is a very value added thing and with this we can make it lot of, standardization. So when the same capabilities we can, allow. Other engagements, other teams use as well. So whenever new applications, new engagements are getting onboarded, things will be very straightforward. They can just simply plug into the Git, look at these Terraform scripts, observability as code solutions, and then integrate these in their AWS environment. So moving on, one of the other thing is enabling APM, application performance monitoring. Here again, if it's like the workload based, Or using open telemetry or serverless blade integrating apm with your serverless frameworks. We can leverage there are observability as code solutions in aws as well And finally, some of these, the platform engineering typical use cases are account creation, access control, and we can use observability as code solution here in AWS, but we can do is we can have a different access layers for your dashboards to the monitors, your logs. synthetic ability to recreate the admin users, read write privileges, account creations, cost views, and all of those things. So we can leverage Terraform and which mean your observability as code solutions with AWS, leveraging Datadog to achieve these capabilities. And finally, of course, like we can use this to streamline our CI CD pipeline integrations using Terraform. So this will be very, beneficial because we want to ensure this observability as code solution goes through multiple environments so that we can take this benefit. And, in case if required, there's a specific customization needed for your distributed tracing setup, we can leverage this as well. And as I said, one of the other important thing is defining service level objectives and sometimes we have seen there's a common framework, common aspects when we are defining because in a typical organizations we are using some of these, the AWS based workloads, the containers and serverless frameworks and that enable us to leverage some of these, standardizations and the common factors. So using SLO as code, what we want is instead of defining SLOs at last, or developing SLOs as once the application goes live, we want our developer community to think about SLOs, service level objectives, when they're designing, developing, testing, and try deploying these, the code into multiple environments, so they can write the SLO as part of your code. Okay. And then ship and let it go through multiple environments until production even to add production so that way that we can monitor our service level objectives and even example in one of our pre live environment we can do performance testing we can do soft testing we can do security testing we can do other form of stress testing and we can see how our SLOs are behaving we can mimic some of the high peak traffic the high concurrent traffic and simulate some of or, spike, or the flood of traffics and different type of user behaviors, and we can see our, how our SLOs are behaving. So this is a very powerful feature. So typically what happened was people define SLOs in Datadog ecosystems at large state for production. But this way, from the very beginning in multiple environments, from the start, we can do that so that there's a lot of visibility. Okay. Consistence and we can achieve a lot of good things by doing this. And obviously Datadog provides whenever there are, events created, we can invoke some webhooks and from there we can, invoke some remediation engines running in our AWS workloads or serverless framework like lambdas with that we can do a lot of, self healing, we can build a lot of self healing solutions. We can do a lot of automation solutions. So observability as code is providing those capabilities as well. And then we can, of course, build our compliance and governance control based out of this observability as code solution. And finally, in case, when, whenever we have any scheduled deployment, scheduled downtimes. or downtime related to back end teams, which will have an impact on some of our monitors. We can enable maintenance windows, or whenever there are CI CD deployment, unless if you are using a very sophisticated deployment strategy, you are calling out a certain level of outage. You can create your maintenance windows as observability as code, and then you can suppress your traps. So you can integrate that part of your the release the CACD pipelines so that way you can test it as well and whenever you are doing this production deployment automatically pipeline will. enable the maintenance window, which is suppressing your relevant traps during that particular deployment window. And after that you can enable it and do a testing as well. So this whole aspect, instead of doing manually, you can leverage the observability as code solutions in AWS to achieve this thing. So moving on, so those are some of the things which we can do as part of observability as code, which will support our ultimate goal of developing a comprehensive observability related components, services, things when it comes to platform engineering. So while you're doing this, I think it's very important for you to be mindful of typical, issues and ensure that you are ready for that. And one of the key issue is whatever we do these days, we had to ensure that we have a way to measure progress and how, whether what we have done is actually allowing business to see some benefits. So we had to look at your mean time for detection, mean time for resolution, mean time for between failures. So we had to ensure that there is a way for we can extract these metrics so that we can measure this when we are going through our observability as called journey in AWS. And we had look at how, whether this is helping us to improve reliability, availability and how this is, helping us to enhance the customer experience. So this can be tagged into your service level objectives, can improve your service level objectives, in a periodic fashion. So those are some of these values we have to bring in part of these kinds of solutions. And also, can you improve your developer velocity? Can this have more, positive impact on your DORA metrics, which is like lead time for change and reduce increasing the change, Frequency, make your production volatile while you manage your customer experience using service level objectives and reduce the deployment failure rate and those things. So you had to think about those things and capturing those metrics and some aspects of this, you can of course use observability as code solutions as well. So this, you can, instead of letting teams is coming up, up with Their own solutions you can make this standard make it also observability as code solution. So That way this can be tracked properly as well And finally before we wrap up so observability as code has a lot of advantages, but there are some best practices You have to be mindful. So number one I would suggest is managing your secure api keys So when it comes to Datadog integration using AWS, you will have to look at, there can be your Datadog API keys and AWS based credentials. So those you will have to leverage, and access in a very, secure way. And you will have to definitely use version control because that's one of the key advantage, key elements bringing in observability as code. So think about ensure you don't miss out and you do the version control. Keep your cord in the gig and those are best practices and you obviously will have to do a modularization instead of you know going a big band way. Modularize it to your agent data dog integrations, APM enable, Real User Monitoring enables, Metric enablement, Dashboard creation, Monitors, Synthetic monitors, Log pipelines, Governance, Access controls. So that way there's modularization so that you can take those advantage. And this observability as code solution has to be plugging with your deployment pipelines. It has to go through multiple environments and until you move into production. So that this way you can take the True potential of this observability being written as code for AWS. And you'll have to look at, role based access controls, enable them, based on your user community. You'll have to look at how you do that as well. And this has to be a centralized solution. So it has, this should provide centralized output, centralized dashboard, centralized architecture. the alerting mechanisms, centralized logging systems, standardization across everything. That is very important. So this has to reside in standardization. So every engagement, every line of business, every team in your organizations are pretty much following a defined and standardization with quality gates and with, the ability to review. Thank you. So this should come up with a lot of documentations, reviews, so that will enable collaboration. And Datadog has a lot of integrations, sometimes leveraging those direct integrations will save you time when it comes to AWS like serverless integration, container integration, and with other different AWS service integrations as well. So all of this is driving collaboration. So you have to ensure that these, taken and taken into your, multiple teams and work with them to enable this collaboration so that you can get this value out of this, enabling observability as code for your AWS ecosystems. With that, I would like to take a close my presentation. I hope you enjoy it and we'll see you in another future presentation. So thank you very much for taking your time. And there are great. amount of presentations done by different presenters, part of platform engineering 2022 and 24 by organized by con 42. I'm requesting everyone to look at these presentations as well. So thanks for taking time. Take care. Bye.

Slides

Download slides (PDF)

See all 50 talks at this event!

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

AWS Observability as Code: Leveraging Datadog for Advanced Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Indika Wimalasuriya

Associate Director, Senior Systems Engineering Manager @ Virtusa

Join the community!

Featured event

2025

2024

Info

Conf42 Platform Engineering 2024 - Online

September 05 2024 - premiere 5PM GMT

AWS Observability as Code: Leveraging Datadog for Advanced Platform Engineering

Video size:

Abstract

Summary

Transcript

Slides

Indika Wimalasuriya

Associate Director, Senior Systems Engineering Manager @ Virtusa

Join the community!