Conf42 Incident Management 2024 - Online

- premiere 5PM GMT

Revolutionizing Incident Management with Cloud Technologies and Java

Abstract

Discover how cloud technologies and Java can revolutionize your incident management! Learn how microservices, serverless computing, and CI/CD integration reduce downtime, speed up incident response, and boost scalability by 78%. Unlock the future of resilient, scalable incident handling!

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hello everyone. My name is Arvind Kumar Akola. I work as a staff software engineer in Guaranteed Rate. Today my talk is about revolutionizing incident management with cloud computing and java technologies. We will discuss on the introduction to cloud computing and java technologies and we will discuss about the challenges with the traditional incident management and also we'll see how cloud is taking the advantage of this particular issue. And how the synergy between Java and cloud services, along with microservices, serverless computing, CI CD integration helps in incident management. We'll also discuss about real time monitoring tools. We'll see how we can do automation incident responses. we'll discuss about how we can collaborate and communicate better with the teams, in case, we have to handle the incident management. We'll also discuss about some data driven insights and also a case study, done by IBM for one of the e commerce applications. Introduction to cloud computing and Java technologies. We all know the cloud computing has become the cornerstone of digital transformation. So the reason is like it, cloud computing offers many big features which are like scalability, cost efficiency and agility. With this, the, with this current global market, the cloud computing is projected to reach around 832. 1 billion dollars by 2025, right? So with this, we can assume that, that cloud computing has become one of the modern business strategies. For most of the enterprise applications like Netflix, Amazon and eBay and a lot of other companies use cloud computing along with that, like Java is one of the main application programming language. Which is very reliable, and been used in enterprise environments, right? The java's main philosophy was platform independence, which is write once and run anywhere, right? This perfectly fits with the distributed nature of the cloud computing, right? Where the this will help the developers to write portable applications which run seamlessly across multiple cloud platforms. the combination of this platform independence and the distributed nature of cloud computing, this is a powerful synergy which enables the business to develop robust, scalable, and efficient applications. And, Java have an extensive ecosystem. it have many frameworks. The main frameworks were Spring Boot, Micronaut, using which we can develop in, production ready application within a few days. So this makes the ideal technology for driving the digital transformation. That's about it. Let's go ahead. Here we are going to discuss about the challenges of the traditional incident management. earlier before cloud computing and, microservices and Java, right? We were struggling with siloed systems and slow response time, right? So based on a study from Gartner, We found that the, the downturn per minute cost around, 5, 600 per minute. So we can assume that, in a holiday season where the traffic speaks, whenever there is an outage, it's going to cost that particular, usage of the customers, right? We are going to miss that particular, traffic. So that's the last which was, included here by the Gartner and, the manual process which leads to the human errors right here. there are a lot of things which we do manually where we need to search logs, copy the, ID, request IDs. And, we update the, spreadsheets and contact the team members individually, right? There is a high probability here we, We will have manual exceptions and manual errors due to which, this traditional incident management is not, up to the mark. And the siloed system, hinders the, collaboration and visibility, Because we have a lot of monitoring tools, which are not like, we used to directly go to Tomcat and, and other application servers. We used to Check the errors there and we don't have an integration with this particular logger mechanism with our communication channels. So that was the issue. And also legacy infrastructure lacks the scalability to handle the modern demands, right? So the infrastructure which we used to have earlier. Let's say we have 10 nodes to serve the traffic, whenever there is a holiday season, we need to scale up, and we need to plan ahead and have a team which to work on, building up the servers and deploy the application and make it ready at that particular peak time. this all can be avoided with, cloud computing. We'll see that in the next slide. So here we are going to discuss how cloud, is staying in the advantage of, and these are the main things, which already we, we touched just now. So the solution which cloud provides is the agility, scalability, and resilience, right? See, the on demand resource, scale up and scale down can automatically happen in cloud. And automatically we can, accelerate the incident responses, and whenever there is a system or a database or something is down, we can sell, we can add the cell feeling, in cloud computing. And we can have the centralized platform enhancement collaboration like integrated monitoring, logging and communication tools. We have a lot of examples for this. We can, we'll discuss in a minute, like Kibana, Grafana and a lot of other integration tools, where we can integrate all the K8, clusters and the local instances. We can get all the logs in one place. And the main important feature here is the reducing the infrastructure cost. So whenever there is a traffic, we will increase the number of resources we scale up and whenever we don't have, we can eliminate those particular, we can free up those resources so that we can, we can decrease the, maintenance cost. So here we are going to discuss how the Java, is going to be the powerhouse of cloud native implementation for incident management. So the platform independence for, is the main feature where we can write once and, run anywhere, which will help for seamless deployments across multiple cloud platforms, right? Using this, we can, build a robust, like we have, A framework like Spring Boot where we can build the microservices and we have Spring Cloud for Cloud service discovery. We can do configuration management and load balancing Using, this framework and we have an extensive library of, monitoring and logging tools along with alerting, like log four JSFL, four J and micrometer from Spring Boot. So these libraries, will help us in logging, the application errors or info messages. this can be easily, Help for help the java application and we have a large, an active community For supporting an innovation new tech new things, can be integrated in java, right? We'll discuss more about this now. This is about microservices. How, Earlier right when we were discussing about this iload environment Which was a single monolithic application where, which, which can be divided into multiple smaller, logic. So here we'll take a separate example for this where, we have a framework where we have a module called user authentication. We have an instant logging option and also a notification dispatcher, right? So we can assume a name. monolithic application architecture, whenever there is an issue with user authentication, we, it is going to impact the incident logging and notification dispatching, right? So similarly, whenever we have an issue with incident logging, it is going to impact the user authentication because all are running in one single instance, right? So this can be avoided using the microservices, right? In microservices, we can isolate each and every module here has We have incident, user authentication and everything has a separate whenever there is an issue with one particular, incident logging, it is not going to impact the user authentication microservice. Similarly, we have deployment updates independently without affecting other services, right? Whenever we want to scale up, let's say we in traffic peak hours, we want to handle the more number of users, right? So at that point of time, we can scale up user authentication module. microservice and then we can handle that particular, only service can, and we can save here, like not increasing the others, right? and also we, and, so this will help, during the high traffic situations. Yeah. So coming to cloud computing, we have something called serverless computing, where we, where, we can achieve the similar kind of thing, whatever we discussed without a, without we focus on, we can, instead of focusing on infra, we can just focus on code and business logic, right? this is a effortless scalability and reducing the overall operation costs, right? Using serverless, computing. the best example is, AWS Lambda. We have one, here, the diagram which we are seeing here. We have one AWS Lambda and this can be automatically triggered instant responses, when, function to send the notification whenever there is an incident created, right? So this will send a notification. So automatically scale, the resources based on the real time demand, right? This is only, triggers only when there is an incident. That means it's only real time, right? and here we pay only for the computation time which we actually used. We don't need to pay whenever the server is, not used, right? and also we don't have to worry about the, number of servers used and we don't want, have, the users, to use this, right? this will free up the developer where they can focus on incident management logic instead of focusing on the infrastructure. Here we are going to discuss about CI and CD pipelines, which will continuously improve for the faster resolution, right? the main logic here is like we have couple of tools in market right now, Jenkins, Git, Git actions. And, here whenever there's a new, deployment happens. our microservice is, created, right? We deploy that and, the automatic code deployment will happen using Jenkins, right? And, whenever a build is deployed, bill is, triggered. It'll deploy into a particular docker and COU K eight, and, it'll run. we can also have the automatic, automa automated test cases, right? So this will improve the code quality. And it will run all the J units and end to end testing will happen, right? This will also, the other example which we can take here is, even in the case of issues, right? Whenever there is an issue, the rollback can happen easily, right? using Jenkins. Whenever there is an issue, it won't be deployed into the server, right? We can stop the deployment and, the particular, and also we can rollback easily in other environments and we can check, what was failing. So this is going to help, the, improve the faster incidence resolutions using C and CD pipelines. So here we are going to discuss about real time monitoring and alerting, right? We have multiple tools to achieve this in current, applications. Prometheus is one of the popular thing. And, we have Grafana and we have Dryna Trace, where, we can identify if there are any CP outages, right? there are a couple of, scenarios here where we can, enable alerts. whenever, the, here we are taking example of whenever the CPU, usage, exceeds more than 80%, let's say, create an alert and, send it to the application data. So here, we can trigger an alert based on, based on the, based on this particular parameters, And also we can analyze data in, in identify the anomalies, with the potential issues, With this can easily, we can easily see the number of spikes also. We can have this, and also we can, Have the, collection of the real time data, with respect to the monitoring the CPU's usage, memory usage, and network traffic and error rates. We can also monitor the user logins. Whenever there is a huge number of logins, let's say we want to track it and we want the alerts to be sent to the application team, right? We can do that with this real time monitoring and, alerting systems. Here we, we saw about real time monitoring alert, right? We want to do an automatic incident response, right? Whenever there is an incident reported, we want to, send it to the particular application team and we want somebody to address that particular issue, right? If they complete this, there are a couple of examples where we have P0 issues like database server is down or the vault is down for an application. For these kind of issues, we, we need somebody to manually check this. So for that, we can integrate it with the pager, so that the notification. So we can see here the how the incident are managed and automatically integrated with pager duty. and we can escalate based on the severity of the issues. And a couple of examples here is the CPU high usage. And if some application is completely down, we can trigger this. And collaboration and community. This is one of the, main thing, which, and has the dream work with centralized, communication channels and collaboration site. Here, main things which we want to discuss here is, we should have a shared platform for incident communications, right? Like Jira, we have Jira boards, and also we can have, we can have the team, we can have a group channels. where we can discuss about the incidents and we can have, the slack channels and Microsoft teams, where we can easily communicate about the issues. So earlier, whenever we, had issues, In automatic incident responses and real time monitoring, we want to send them to this particular Slack channels, right? So where we can integrate them with the Slack channels. So whenever there is a high priority issues, we can directly send it to a particular application team and we can address to all the stakeholders related to that particular application. So this will help in, collaborating and we can have a centralized communication for all the teams. So that we can address the issues easily. And we can also create a knowledge base for this using wiki pages or any of this, right? Whenever and also we can have a weekly credence for discussing this kind of issues. So here we are going to discuss about, the data driven insights, right? So here we have, we can do, detailed comprehensive, data analysis. So there are multiple ways to do it. We can use, use Tableau reports. We can, we have Elasticsearch dashboards and we have Kibana. reports, right? And, based on this, we can have multi monthly and weekly meetings where we can discuss about, what happened when a production release happens, right? We can see the Tableau reports and we can discuss, about, how the traffic is getting, communicated with the new release. Whenever there's a new feature, released, we can see how the users are, how the users are, communicated. And we can identify the patterns, proactively and prevent the future incidents, right? Whenever there is a issue, we can see those, common root causes and we can identify them. the key, metric to track here is the mean, time to resolution. which is called as MTTR. So we can see the incident and we can see we can track the frequency of the resolution for this. So this is how we handle the data. Going on next. So here we are going to discuss about a case study, which is done by IBM with a, one of the leading e commerce, application, right? So this application, e commerce company was, struggling with slow response time and frequent outages using, during the peak up, peak season. Peak shopping season, right? So once they're migrated to, microservice architecture with AWS, are using Lambda and, for our automated incident responses. So below the, highlighted, results they achieved, they achieved 78 percentage of scalability improvement. And they reduce the, m empty TR by 50 percentage, and they decrease the incidents by 30 percentage. See, overall, the scalability has been improved by 78%. This is a great achievement, I say, using both the Java microservices and also using the cloud based application, right? So this is the survey result. So this is a great achievement. And here are the key takeaways, guys. the cloud and Java enable the efficient incident management. we discussed about how we are, how both cloud and, Java, doing this. We are, we already discussed about, microservices and serverless computing and has the scalability, how the scalability, applies for both, serverless lambda and also mic, boot microservices. We see the CNCD monitoring ML drive faster for quicker responses. And also we discussed about moving beyond the incident management, right? Where we can create the dashboards and we can see the, preventive measures based on the, historical data, right? Agility screen here, where we respond to the incident which, Speed and efficiency so and also we have we can take the decisions based on data, right? So based on this data we can see how we can continuously improve We can identify the issues and we can make sure that the issues cannot Occur the next time right? So these are the key takeaways guys with both the cloud and java technologies on how we can handle the incident management. Thank you everyone Bye. Bye
...

Arvind Akula

Staff Software Engineer @ Rate

Arvind Akula's LinkedIn account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways