Conf42 Observability 2024 - Online

AWS: Your Ally Against Observability Anti-Patterns

Abstract

Unlock the power of AWS in conquering observability challenges. Learn how tools like CloudWatch and X-Ray offer structured solutions for efficient log, metric, and trace management, aligned with business objectives for enhanced system reliability.

Summary

  • More than 64% of organizations believe they should effectively start monitoring their endpoints. 44% of the organizations believe learning from failures are important. But close to one fourth of them still have contractual breaches happening regular intervals.
  • Indigo Immelsuria: I'm going to walk you through why I truly believe AWS is your ally or the best friend when it comes to implementing observability. As part of my presentations I will discuss these anti patents and how leveraging AWS will help you in your journey.
  • Indik Uma Soory has 18 years of experience working in industry. His expertise are predominantly on site reliability engineering, observability, AI, Ops, DevOps and generative AI. Currently working at Virtusa.
  • Microservices, observability, cloud has resulted in tons of data introducing multiple failure points. Observability is very important in this complex distributed system architecture. Why? Because we want to be on top of performance.
  • In distributed systems, cloud microservices, observability is key. Observability is a friend in enabling reliability systems. It's built on top of our infrastructure, monitoring, network monitoring, security and cost optimization.
  • One of the challenge we have is sometimes you have more locks. This is the place where AWS has done lot. The Cloudwatch has the ability of integrating with logging. You are able to use these cloud watch capabilities to overcome some of these anti patterns.
  • While metrics are good, there are a lot of anti patterns as well. Sometimes we are coming up with lot of unclear misaligned metrics. By using Cloudwatch, special cloud metrics, what you can do is you can focus on very easily the availability of metrics.
  • By leveraging AWS x ray or using open telemetry smartly, you are able to get your traces in front. And finally, when you look at end to end big picture some observability problems. All those things will help you in overcoming some of these challenges.
  • One big anti pattern I have seen is in observable implementation, is not having a plan. Having a plan means having an observability maturity model. So you are able to take your observability from reactive to proactive and predictive to autonomous.
  • With this I thank you for taking time to listen. I hope you found this session useful. If you have any questions, send a note in LinkedIn there are lot of nice thought provoking videos presentations happening. Part of observability 2024.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
For a recent Catwoman Sari survey identified that more than 64% of organizations believe they should effectively start monitoring their endpoints, even if they lie outside their physical control. And 44% of the organizations believe learning from failures are important and they should invest on identifying more ways of preventing these kind of disruptions. And interestingly, 24% organizations had a recent breach, which means they had a recent contractual breach during last twelve months and more than majority which is around 66% of these organizations. They are already using two to five monitoring tools or the observability tools. What literally this means is monitoring or observing the key endpoints is close to every organization's hearts, minds and their operations and they are understanding that the failures are normal and they are trying to learn from me as well and then trying to figure out what are the ways they can prevent these incidents. But unfortunately, close to one fourth of them still have contractual breaches happening regular intervals. And this is on top of they are bringing in multiple observable tools to help in their course. So something is not working and I'm pretty sure you got the numbers here. Hi everyone, my name is Indigo Immelsuria, so welcome to Observability 2024. As part of Observability 2024, I'm going to walk you through why I truly believe AWS is your ally or the best friend when it comes to implementing observability. Observability as you know, it's about understanding internal system behaviors so that we can proactively eliminate some of these issues or if at all, identify some of these disruptions in advance. And then we can improve our mean time for detection and then mean time for resolutions while implementing observability. There are widely documented anti patterns you will come across and there are some unknown anti patterns as well hidden. And as part of my presentations I will discuss these anti patents and how leveraging AWS will help you in your journey. As part of my presentation I will discuss importance of observability, why observability is important for organizations. There are no two questions about that, why it is important and we will understand some of the observability anti patterns and we will also look at AWS capabilities services AWS offering and we will deep dive and try to understand some of these the services AWS offering and how we can leverage them to fight against our war of ensuring that we don't succumb to anti patents. So we will discuss about some of the implementation guidelines, the best practices. Moving on, a quick intro about myself. My name is Indik Uma Soory I based out of Colombia, Sri Lanka. I live with my daughter and wife. I am currently having close to around 18 years of experience working in industry. My expertise are predominantly on site reliability engineering, observability, AI, Ops, DevOps and generative AI. I'm currently working at Virtusa. I'm overseeing the technical delivery, the solutioning, architecturing and capability development as well. One of the significant part of my current job and outside I'm enjoying is being a trainer. I'm a very passionate sharing knowledge, empowering others and building the community. I'm a very passionate technical blogger as well. You can find me at Dev two. I'm also AWS community builder under cloud operations and also ambassador at DevOps Institute. So as I said, AWS and observability is very close to my heart. And I have been involved in implementing lot of observable solution for Fortune 500 companies and this over the period of time I have understand these anti patterns and some of the best practices and how we can leverage AWS to overcome and expedite our journey. So let's discuss the complexities we have in current world. The current systems, as you already know, moved out from monolith and now we have the microservices and obviously microservices bringing in a lot of complexities which resulted in me moving out from monitoring which is about, you know, doing something which is already predefined. But now this day and age, our systems have lot of unknowns and we have to figure out these unknowns have better controllers and observability is providing us a greater solution. Observability is trying to understand the internal state and that results enormous of data which we are getting as well. And you might already know, almost all our systems moved out from on premise. We are now in cloud. So microservices, observability, cloud has resulted in tons of data introducing multiple failure points. So much of complexities in our systems and with results it's little difficult to manage. The unknowns get hidden and then they will appear in the most unexpected times and will have disruptions which will cause bad user experiences, impact on our revenue, lot of other things. So observability is very important in this complex distributed system architecture. Why? Because we want to be on top of performance. So distributed systems are very good, but monitoring them, observing them, identifying their bottlenecks and improving the performance is key. Just because you have a distributed system, not necessarily mean you will get the optimal performance. And as we discussed earlier, distributed systems being complex, there are lot of hidden unknowns. These unknowns can appear at any given time and will bite you. And detecting these issues quickly, identifying them, resolving them, is far more and more than that. What organizations try to do is eliminate them, if at all it's possible, because there's nothing better than fixing something before it's going to impact you. All of this means is we want to have more reliable systems. So reliable systems means reliable services and end of the day happy customers. And also we have to understand that just like any other thing, these systems will also crash and these systems will have some bugs. This system will go through some cycles of bad times and this happen. We want to build a comprehensive framework or mechanism which allow us to debug and troubleshoot these things and fix them promptly. So overall observability is important for distributed systems to make sure that our systems are doing what it's supposed to be doing. And we have the ability of understanding and managing these systems, and ability of eliminating some of the disruptions by identifying them early and then looking at all those observability pillars like metrics, logs your traces, events and gives you the proactiveness which you always want, so that you can be on top of your game in managing distributed systems. While it's important when you look at observability from a distributed angle, moving to cloud, even though it has fixed lot of our problems, it's again, people start to understand cloud is not the single solution for all your problems, and especially availability and reliability. Cloud providers are providing you a platform where you can deploy your systems, your services and data, and you have that accountability from your side to making sure that you manage it properly. So when it comes to cloud, still your scalability, unless you are using managed services, unless you are using some of those cloud scalability solutions, scalability is your problem and security is obviously one of your problems as well. And what we have done is we are good at managing the risk so that we look at poly clouds, like deploying things in multiple clouds, so polyclouds, multi clouds, bringing in another different complexities and a different total dimension into this problem. And also we want to ensure that we reduce the cost. So when you want to achieve all of these things, observability is paramount. This again for the cloud, observability is important and observability gives us this ability to develop and maintain optimal systems in cloud. And finally microservices, almost all complex systems, we are moving out from monolith to microservices. So when it comes to microservices, it has so much of good and so much of capabilities which we are harnessing to provide better customer experiences. But what you have to understand is microservices itself are bringing in lot of complexities like dependencies, troubleshooting of bottlenecks in performances. The complexities result in time taken to isolate root causes, scalability problems, debugging related problems. So all of these things results that. Observability is key here. So as I've been going through for last few minutes, either in distributed systems, cloud microservices, observability is key. Observability is a friend in enabling reliability systems. So if you look at observability, Gardner published the observability hype cycle. So it's about how what has trigger observability at peak, what are the things we were interested in and then how this the hype cycle went, and then what are the takeaways. So if you can see, APM is what something widely industry has been leveraging, which is application performance management. It's about phrases. We have the logs. Logs are, which is predominantly the traditional part of observability. And then we have the metrics, which is we very happily used to configure our alerts, because the metrics are numbers. Numbers are good at, you know, measuring things. And the traces help us for the troubleshooting to understand, allow us to go through and dig in and understand exactly what's happening. Because all the years we found that monitoring sometimes focus on infrastructure. But infrastructure is one part of the big picture. It's actually a code which is doing bulk, like we have done a lot of good things, improvements into infrastructure now. So now the focus is back to the code. And APN pays a major role here. So when it comes to observability, observability is all about the logs, which is about creating audit trail. It's about metrics, which is providing you the ability of configuring alerts, measuring things, the traces, which is about digging into your code, understanding the bottlenecks where the code is not performing metrics are helping you to develop alarms and all of these things allowing you to create dashboards. We have the synthetic monitors. Over the years we have developed some other capabilities like real user monitoring, which is about monitoring and observing what our end users are doing at front end. And of course this all built on top of our infrastructure, monitoring, network monitoring, security and cost optimization. So in nutshell, observability is looking at your entire system holistically and trying to understand things before they go wrong. So now that you have understanding of observability, why observability is important and what you are trying to achieve. Let's look at AWS. So AWS over the years has been bringing in lot of capabilities in observability area. And one of these, it started with Cloudwatch. Cloudwatch is integrated with almost all AWS services, so that you have the ability of, you know, shipping all your logs there and then integrating all the metrics from there. You can create the dashboards, your alerts. And then AWS introduced things like AWS x ray which have the ability of looking enabling phrases. You have the option of probably going with open telemetry. And AWS introduced things like real user monitoring to monitor the front end. And obviously recently they introduced things like application signals. So all of these things, AWS have a collection of very powerful set of services. Either you can go with AWS native services, or if you are more for open source, open source kind of a person, you can use AWS grafana on top of that, use some things like open search or manage Prometheus and jaggers, ifkin and others to enable your traces as well. So AWS is able to provide these capabilities for the both kind of worlds, right? Either you are AWS native person or you are open source person. So all of these capabilities will enable you in developing a great observability platform for your systems. So the idea is that how you can use this and get that benefit and when you are doing that, understanding some of these anti patterns are very important because by knowing them, you know how best you can use some of these services which will automatically in some instance will able to help you and ensure that you don't fall into these anti patterns. So moving on, let's discuss some of these anti patterns and I'll go through them by the pillars of it observability, especially in logs. One of the challenge we have is sometimes you have more locks. It's very difficult, it's a good problem to have. But when you have excessive login and when you does sometimes have little bit of unstructured, not a structure way, but it does is generate a lot of noise and it's harder to extract lot of details. This is the place where AWS has done lot. The Cloudwatch has the ability of integrating with logging, not only just ship the log to Cloudwatch and you can do that. AWS has come up with lot of new capabilities like log anomaly detections and things like natural language searching. These capabilities allowing you even you have in the situations where you have excessive locks, there's no structure, it's little bit of hard to troubleshoot. You are able to use these cloud watch capabilities to overcome some of these anti patterns. And then when you look at metrics, while metrics are good, there are a lot of anti patterns as well. Sometimes we are coming up with lot of unclear misaligned metrics which finally resulted in service level objectives. We probably sometimes coming up with some bad sampling when you're doing the matrix. And sometimes these metrics, there are so much of metrics that it's very difficult to understand how to kind of like pick the right metric for your needs. So what this doing is, it's bringing you false sense of comfort saying you have a lot, but actually this might not correlate with the end user experience. And sometimes a bad sampling might result in you are not getting when you need it, you might not have the data and unnecessarily having numerous set of metrics might lead to unnecessary complexities. So AWS, by using Cloudwatch, special cloud metrics, what you can do is you can actually focus on very easily the availability of metrics and then go very quickly and trying to understand what makes sense instead of you spending lot of time trying to enable metrics and then later trying to figuring out what is required and what might not add value by using Cloudwatch. And usually when you plug Cloudwatch into your services and you know the metrics will start appearing and then very quickly you can go and go through and understand what is this metrics doing which is have more relevance to you. And probably you can do some customizations when it comes to the data. And this helps you in ensuring that you pick the right things, write things which add values to you. And also you can use things like the AWS, some of the default metrics which will help you to figure out in situations where cardinality is a problem. So in nutshell, Cloudwatch matrix is a beautiful thing which will enable you your metric enabled journey in observability smoothly. Yeah, it might already help you in ensuring that you don't fall into some of these anti patterns. So when it comes to the tracers, tracers also have quite a few anti patterns. Sometimes some traces we don't give the priority, sometimes there's a lack of trace id consistencies and sometimes the instrumentation is not enough. By leveraging AWS x ray or using open telemetry smartly, you are able to get your traces in front. And the beauty is that not only services like what you do, microservices and other things, but even the lambdas and even the other things like you have that option of enabling traces using AWS x ray real quickly and that will enable you that you have more traces, you have the consistencies, and you can get the power of distributed tracing. So AWS does this seamlessly without you want to do a lot of hard work. And when it comes to traces, there are few more like things like continuing the context. Context is very important when you are trying to connect from front end to the back end and looking at the traces and the visualizations and kind of like connecting with realism on terrain. So AWS by nature, the X ray, the capabilities are allowing you significantly reduce the manual effort of spending time to fix some of these problems. X ray does it in some instance magically like if you are using AWS lambda, enabling x ray is probably a one click. So x ray as a service is powerful and that's allowing you to mitigate some of these challenges. You will come across when you want to enable traces. And finally, when you look at end to end big picture some observability problems. We have a alert overloading. This calls alert fatigue which is about you have flood of alerts and you are not able to understand why or isolate what has caused it. So AWS is providing a lot of services like Cloudwatch, SNS and those things where you can even bridge, even you can smartly configure them and able to get through some of these about the alert overloading situations and lot of places you have seen the observability is destroyed. But with AWS x ray, Cloudwatch and other things, you are able to bringing in some unified view into your observability framework where you can see things when it requires from 13,000ft above and you can then drill down whenever you need. And one other thing is usually people sometimes ignore the non functional requirements when it comes to observability, which is in my mind it's key. So AWS services allow you to get some of these non NFR metrics and other data like you probably you might be using AWS lambda or RDS managed services. All those things will help you in overcoming some of these challenges as well. Not necessarily might related to observability, but going with these services will definitely help you in achieving what we are trying to achieve in your observability objectives. And one other anti pattern is most of our applications are not isolated. We have upstreams, we have downstreams, we have a lot of independent dependencies. So that creates lot of buying frauds in our infrastructure and managing these things. It's also little tidies. But you can use AWS systems manager when you are doing a lot of updates, patching and other things. You can use AWS cloudformation to bring in consistency infrastructure core solutions that will also enable you in building some of these observability in automated fashion into your system. And obviously there are some environmental inconsistencies that you can address by using AWS services like elastic build stock or code pipelines and those things. And while it's good, we have gone through the observable pillars, the matrix logs and phrases, and also the big picture of observability and what are the anti patterns and how we can use AWS services and AWS services like Cloudwatch, Cloud watch Matrix, the X ray and the dashboards and other factors, how it's allowing you to ensuring that by nature of using these services you are able to mitigate some of these anti patterns. And one thing, one big anti pattern I have seen is in observable implementation, is not having a plan. Probably will jump in and going there and trying to do that. But sometimes what's important is having a plan. And having a plan means having an observability maturity model where you have understanding that what is your end game. So what do you want is have a plan where you can take your observability from reactive to proactive and then proactive to predict you, and then from predictive to build your system to have capabilities like autonomous. So I am not suggesting that you take this and stick into the assets, but I am suggesting that you create some maturity model which suits you, so that it suits you, so that observability is not just a destination, it's a journey. So you are able to take your observability from reactive to proactive and predictive to autonomous. So when you are working with your logs metrics, traces, you are able to look at how just it's not just enabling logs, it's about trying to take value out of logs. It's about enabling AI into logs so that you can cut down some of these manual touch points. It's about making your system autonomous. And when it comes to infrastructure, networking, security, again, the same thing apply. It's about making and pushing things from reactive to proactive and to predict you and then autonomous. And one of the other important things in observability is having understanding that how to measure your progress. So what you can do is you can measure your progress, you can measure your progress by looking at some of these things like mean time for detection mean time for resolution or that means that are you detecting things quickly, are you able to provide solutions quickly and what is the interval between your failures and are you improving your system reliability, how is your customers are feeling about your systems and are you able to increase your developer velocity by achieving your service objectives? This will help you to understand that whether you are align with business goals, because end of day this is all about business. You have to add value into your business. Unless you are adding value to your business observability will your car, your business partners will not see that gate or value and the anti patterns you can turn around to best practices as well. The opposite, ensuring that there's a standard in login, there's a better way of managing logging and better way of using the traces your instrument. You focus on instrumentation, automated and responses and continuous driver to achieve performance optimizations and things like going into some of these AWS managed services. These are very good because it's managed and it's all are integrated with Cloudwatch and you are able to use the SNS and other the alerting capabilities and the dashboardings. So there's a unified way of doing that. So that's the whole point when it comes to AWS. AWS has bundled these things so that it's one place to go and the unification and the simplification will provide you far value and help you in your journey from moving from proactive to reactive and then to autonomous. With this I thank you for taking time to listen. I hope you found this session useful. You can find me in LinkedIn. If you have any questions, send a note in LinkedIn there are lot of nice thought provoking videos presentations happening. Part of observability 2024. I encourage you to go and listen to others and I'm sure everything will help you in your observability journey. Thank you very much for taking time. It was my pleasure presenting.
...

Indika Wimalasuriya

Associate Director, Senior Systems Engineering Manager @ Virtusa

Indika Wimalasuriya's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways