Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hello everyone.
My name is Arvind Kumar Akola.
I work as a staff software engineer in Guaranteed Rate.
Today my talk is about revolutionizing incident management with cloud
computing and java technologies.
We will discuss on the introduction to cloud computing and java technologies
and we will discuss about the challenges with the traditional incident management
and also we'll see how cloud is taking the advantage of this particular issue.
And how the synergy between Java and cloud services, along with microservices,
serverless computing, CI CD integration helps in incident management.
We'll also discuss about real time monitoring tools.
We'll see how we can do automation incident responses.
we'll discuss about how we can collaborate and communicate better
with the teams, in case, we have to handle the incident management.
We'll also discuss about some data driven insights and also a
case study, done by IBM for one of the e commerce applications.
Introduction to cloud computing and Java technologies.
We all know the cloud computing has become the cornerstone of digital transformation.
So the reason is like it, cloud computing offers many big features which are like
scalability, cost efficiency and agility.
With this, the, with this current global market, the cloud computing
is projected to reach around 832.
1 billion dollars by 2025, right?
So with this, we can assume that, that cloud computing has become one
of the modern business strategies.
For most of the enterprise applications like Netflix, Amazon and eBay and a lot
of other companies use cloud computing along with that, like Java is one of the
main application programming language.
Which is very reliable, and been used in enterprise environments, right?
The java's main philosophy was platform independence, which is
write once and run anywhere, right?
This perfectly fits with the distributed nature of the cloud computing, right?
Where the this will help the developers to write portable
applications which run seamlessly across multiple cloud platforms.
the combination of this platform independence and the distributed
nature of cloud computing, this is a powerful synergy which enables
the business to develop robust, scalable, and efficient applications.
And, Java have an extensive ecosystem.
it have many frameworks.
The main frameworks were Spring Boot, Micronaut, using which we
can develop in, production ready application within a few days.
So this makes the ideal technology for driving the digital transformation.
That's about it.
Let's go ahead.
Here we are going to discuss about the challenges of the
traditional incident management.
earlier before cloud computing and, microservices and Java, right?
We were struggling with siloed systems and slow response time, right?
So based on a study from Gartner, We found that the, the downturn per
minute cost around, 5, 600 per minute.
So we can assume that, in a holiday season where the traffic speaks, whenever there
is an outage, it's going to cost that particular, usage of the customers, right?
We are going to miss that particular, traffic.
So that's the last which was, included here by the Gartner
and, the manual process which leads to the human errors right here.
there are a lot of things which we do manually where we need to search
logs, copy the, ID, request IDs.
And, we update the, spreadsheets and contact the team
members individually, right?
There is a high probability here we, We will have manual
exceptions and manual errors due to which, this traditional incident
management is not, up to the mark.
And the siloed system, hinders the, collaboration and visibility, Because
we have a lot of monitoring tools, which are not like, we used to directly go to
Tomcat and, and other application servers.
We used to Check the errors there and we don't have an integration
with this particular logger mechanism with our communication channels.
So that was the issue.
And also legacy infrastructure lacks the scalability to handle
the modern demands, right?
So the infrastructure which we used to have earlier.
Let's say we have 10 nodes to serve the traffic, whenever there is a holiday
season, we need to scale up, and we need to plan ahead and have a team which
to work on, building up the servers and deploy the application and make
it ready at that particular peak time.
this all can be avoided with, cloud computing.
We'll see that in the next slide.
So here we are going to discuss how cloud, is staying in the advantage
of, and these are the main things, which already we, we touched just now.
So the solution which cloud provides is the agility,
scalability, and resilience, right?
See, the on demand resource, scale up and scale down can
automatically happen in cloud.
And automatically we can, accelerate the incident responses, and whenever
there is a system or a database or something is down, we can sell, we can
add the cell feeling, in cloud computing.
And we can have the centralized platform enhancement collaboration
like integrated monitoring, logging and communication tools.
We have a lot of examples for this.
We can, we'll discuss in a minute, like Kibana, Grafana and a lot
of other integration tools, where we can integrate all the K8,
clusters and the local instances.
We can get all the logs in one place.
And the main important feature here is the reducing the infrastructure cost.
So whenever there is a traffic, we will increase the number of resources we
scale up and whenever we don't have, we can eliminate those particular, we can
free up those resources so that we can, we can decrease the, maintenance cost.
So here we are going to discuss how the Java, is going to be
the powerhouse of cloud native implementation for incident management.
So the platform independence for, is the main feature where we can write
once and, run anywhere, which will help for seamless deployments across
multiple cloud platforms, right?
Using this, we can, build a robust, like we have, A framework like
Spring Boot where we can build the microservices and we have Spring
Cloud for Cloud service discovery.
We can do configuration management and load balancing Using, this framework
and we have an extensive library of, monitoring and logging tools along
with alerting, like log four JSFL, four J and micrometer from Spring Boot.
So these libraries, will help us in logging, the application
errors or info messages.
this can be easily, Help for help the java application and we have a large,
an active community For supporting an innovation new tech new things,
can be integrated in java, right?
We'll discuss more about this now.
This is about microservices.
How, Earlier right when we were discussing about this iload environment
Which was a single monolithic application where, which, which can be
divided into multiple smaller, logic.
So here we'll take a separate example for this where, we have a framework where we
have a module called user authentication.
We have an instant logging option and also a notification dispatcher, right?
So we can assume a name.
monolithic application architecture, whenever there is an issue with
user authentication, we, it is going to impact the incident logging and
notification dispatching, right?
So similarly, whenever we have an issue with incident logging, it is going to
impact the user authentication because all are running in one single instance, right?
So this can be avoided using the microservices, right?
In microservices, we can isolate each and every module here has We have incident,
user authentication and everything has a separate whenever there is an
issue with one particular, incident logging, it is not going to impact
the user authentication microservice.
Similarly, we have deployment updates independently without
affecting other services, right?
Whenever we want to scale up, let's say we in traffic peak hours, we want to
handle the more number of users, right?
So at that point of time, we can scale up user authentication module.
microservice and then we can handle that particular, only service
can, and we can save here, like not increasing the others, right?
and also we,
and, so this will help, during the high traffic situations.
Yeah.
So coming to cloud computing, we have something called serverless
computing, where we, where, we can achieve the similar kind of thing,
whatever we discussed without a, without we focus on, we can, instead
of focusing on infra, we can just focus on code and business logic, right?
this is a effortless scalability and reducing the overall
operation costs, right?
Using serverless, computing.
the best example is, AWS Lambda.
We have one, here, the diagram which we are seeing here.
We have one AWS Lambda and this can be automatically triggered instant
responses, when, function to send the notification whenever there
is an incident created, right?
So this will send a notification.
So automatically scale, the resources based on the real time demand, right?
This is only, triggers only when there is an incident.
That means it's only real time, right?
and here we pay only for the computation time which we actually used.
We don't need to pay whenever the server is, not used, right?
and also we don't have to worry about the, number of servers used and we don't
want, have, the users, to use this, right?
this will free up the developer where they can focus on incident management logic
instead of focusing on the infrastructure.
Here we are going to discuss about CI and CD pipelines, which will continuously
improve for the faster resolution, right?
the main logic here is like we have couple of tools in market right
now, Jenkins, Git, Git actions.
And, here whenever there's a new, deployment happens.
our microservice is, created, right?
We deploy that and, the automatic code deployment will
happen using Jenkins, right?
And, whenever a build is deployed, bill is, triggered.
It'll deploy into a particular docker and COU K eight, and, it'll run.
we can also have the automatic, automa automated test cases, right?
So this will improve the code quality.
And it will run all the J units and end to end testing will happen, right?
This will also, the other example which we can take here is, even
in the case of issues, right?
Whenever there is an issue, the rollback can happen easily, right?
using Jenkins.
Whenever there is an issue, it won't be deployed into the server, right?
We can stop the deployment and, the particular, and also we can rollback
easily in other environments and we can check, what was failing.
So this is going to help, the, improve the faster incidence
resolutions using C and CD pipelines.
So here we are going to discuss about real time monitoring and alerting, right?
We have multiple tools to achieve this in current, applications.
Prometheus is one of the popular thing.
And, we have Grafana and we have Dryna Trace, where, we can identify
if there are any CP outages, right?
there are a couple of, scenarios here where we can, enable alerts.
whenever, the, here we are taking example of whenever the CPU, usage, exceeds more
than 80%, let's say, create an alert and, send it to the application data.
So here, we can trigger an alert based on, based on the, based on this particular
parameters, And also we can analyze data in, in identify the anomalies, with the
potential issues, With this can easily, we can easily see the number of spikes also.
We can have this, and also we can, Have the, collection of the real time
data, with respect to the monitoring the CPU's usage, memory usage, and
network traffic and error rates.
We can also monitor the user logins.
Whenever there is a huge number of logins, let's say we want to track
it and we want the alerts to be sent to the application team, right?
We can do that with this real time monitoring and, alerting systems.
Here we, we saw about real time monitoring alert, right?
We want to do an automatic incident response, right?
Whenever there is an incident reported, we want to, send it to the particular
application team and we want somebody to address that particular issue, right?
If they complete this, there are a couple of examples where we have P0
issues like database server is down or the vault is down for an application.
For these kind of issues, we, we need somebody to manually check this.
So for that, we can integrate it with the pager, so that the notification.
So we can see here the how the incident are managed and automatically
integrated with pager duty.
and we can escalate based on the severity of the issues.
And a couple of examples here is the CPU high usage.
And if some application is completely down, we can trigger this.
And collaboration and community.
This is one of the, main thing, which, and has the dream work
with centralized, communication channels and collaboration site.
Here, main things which we want to discuss here is, we should have a shared platform
for incident communications, right?
Like Jira, we have Jira boards, and also we can have, we can have the
team, we can have a group channels.
where we can discuss about the incidents and we can have, the slack channels
and Microsoft teams, where we can easily communicate about the issues.
So earlier, whenever we, had issues, In automatic incident responses and real
time monitoring, we want to send them to this particular Slack channels, right?
So where we can integrate them with the Slack channels.
So whenever there is a high priority issues, we can directly send it to
a particular application team and we can address to all the stakeholders
related to that particular application.
So this will help in, collaborating and we can have a centralized
communication for all the teams.
So that we can address the issues easily.
And we can also create a knowledge base for this using
wiki pages or any of this, right?
Whenever and also we can have a weekly credence for
discussing this kind of issues.
So here we are going to discuss about, the data driven insights, right?
So here we have, we can do, detailed comprehensive, data analysis.
So there are multiple ways to do it.
We can use, use Tableau reports.
We can, we have Elasticsearch dashboards and we have Kibana.
reports, right?
And, based on this, we can have multi monthly and weekly meetings where we
can discuss about, what happened when a production release happens, right?
We can see the Tableau reports and we can discuss, about,
how the traffic is getting, communicated with the new release.
Whenever there's a new feature, released, we can see how the users are,
how the users are, communicated.
And we can identify the patterns, proactively and prevent the
future incidents, right?
Whenever there is a issue, we can see those, common root
causes and we can identify them.
the key, metric to track here is the mean, time to resolution.
which is called as MTTR.
So we can see the incident and we can see we can track the frequency
of the resolution for this.
So this is how we handle the data.
Going on next.
So here we are going to discuss about a case study, which is done
by IBM with a, one of the leading e commerce, application, right?
So this application, e commerce company was, struggling with slow response
time and frequent outages using, during the peak up, peak season.
Peak shopping season, right?
So once they're migrated to, microservice architecture with AWS, are using Lambda
and, for our automated incident responses.
So below the, highlighted, results they achieved, they achieved 78
percentage of scalability improvement.
And they reduce the, m empty TR by 50 percentage, and they decrease
the incidents by 30 percentage.
See, overall, the scalability has been improved by 78%.
This is a great achievement, I say, using both the Java microservices and also
using the cloud based application, right?
So this is the survey result.
So this is a great achievement.
And here are the key takeaways, guys.
the cloud and Java enable the efficient incident management.
we discussed about how we are, how both cloud and, Java, doing this.
We are, we already discussed about, microservices and serverless computing
and has the scalability, how the scalability, applies for both, serverless
lambda and also mic, boot microservices.
We see the CNCD monitoring ML drive faster for quicker responses.
And also we discussed about moving beyond the incident management, right?
Where we can create the dashboards and we can see the, preventive measures
based on the, historical data, right?
Agility screen here, where we respond to the incident which, Speed and
efficiency so and also we have we can take the decisions based on data, right?
So based on this data we can see how we can continuously improve We can identify
the issues and we can make sure that the issues cannot Occur the next time right?
So these are the key takeaways guys with both the cloud and java technologies on
how we can handle the incident management.
Thank you everyone Bye.
Bye