Conf42 Observability 2024 - Online

From Metrics Tsunami to Actionable Insights: Simplifying Database Troubleshooting

Abstract

We’ve all been there: drowning in a sea of database metrics, struggling to find a needle in the haystack during an investigation. This talk breaks free from the “more data, more graphs” mentality. We’ll explore how to move from information overload to actionable insights in database troubleshooting.

Summary

  • Divine Odaz eight: How you can go from metric tsunami to actionable insights to simplify your database troubleshooting. He talks about the limitations of legacy database monitoring tools. And then we'll talk about how you can simplify troubleshooting with cluster control.
  • Cluster control empowers organizations to standardize the full life cycle operations of their databases. Features include holistic dashboards, performance monitoring, operational reports for detection, query node balancer monitoring and activity center.
  • Reports can also be shared, it can be emailed. And all this gives you that assurance that your database is well performing. With less cognitive load, with less manual intervention. And you get better customer satisfaction and also developer satisfaction, the developer productivity.
  • Closer control can help you observe your cluster and simplify database troubleshooting. From the onset you get a full single pane of gas overview of all your clusters. Other dashboards give you a complete overview of your closet.
  • And make changes. So now I see, okay, this is a variable with threshold one in 90. So I can change this to okay, let's say two unless compile and run and see what happens. You can edit stuff from your needs, right?
  • So in order for us to get an alarm, let's simulate a long running query. You get a full overview on like what actually happened. And recommendation is to tune the query. Now let me show you how cursor control handles for detection.
  • Closing module can help you simplify database troubleshooting and see what's happening in your cluster in real time. Can also perform actions immediately and how for you to have a well performing and highly available database.
  • Two case studies for organizations that use cluster control specifically. They adapted cluster control majorly for like observability and monitoring. If you're looking to like adopt us to control, you can easily reach out to me.

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, my name is Divine Odaz eight. Today I'm going to be talking about how you can go from metric tsunami to actionable insights to simplify your database troubleshooting. So a little bit about me I'm a technology evangelist at several nines, a company that empowers organizations to take control of their data on every environment, giving them a full lifecycle, dbas experience and mostly certified Kubernetes application developer is certified PWA solution architecture and I'm very much excited to speak at Con 42 observability this 2024. So here's the agenda for my talk. So we're going to start by discussing the developer dilemma and also limitations of legacy database monitoring tools. And then we'll talk about how you can simplify database troubleshooting with cluster control. And look at case study for organizations that are using cluster control to monitor and simplify their database troubleshooting. So is this you, are you drowning database metrics? Do you feel like you're not empowered by overwhelmed? When issues arise, you have to navigate through multiple tools for database observability. This is because of the limitation of legacy database monitoring tools. So yeah, first the major limitation is graph overload. Yet these tools have been able to capture a lot of metrics and data from system operations and database operations. They've also be able to visualize it, right. In beautiful graphs. But then again, as developers who control much of the environment today, right, you don't, most developers don't understand most of the metrics, right? They are being given, they just want simplified, actionable insight, right? Actionable metrics, what the, what they really need to solve their problem as at that time, and also limited DB support. So you, you have to use it to, to monitor your database, right? Your postgres database. But then again, most organizations don't use just one database, right? So then again you need another tool for MySQL, you use Redis, you need another tool for redis or you use any so many other database, there a lot of database around for specific problems. And most organizations use diverse database environments. So you're stuck with the reason of multiple tools, which brings about cognitive overload, right, reduces developer productivity. And another major limitation is visibility without action. So yes, that is the default, right, monitoring. But then again, beyond monitoring, beyond observability, you need to be able to like take actions and like resolved issue, right? So most of these tools just offer that monitoring, right? The legacy to software, the monitoring which makes it that you have to like go implement either manual intervention or using a different tool to get. I dreams about cognitive load, cognitive overload. And so yeah, if you're tired of the limitations of these legacy tools, you can try cluster control. And that's what we're going to look at in this talk. So next, simplifying database troubleshooting with cluster control. So cluster control empowers you organizations to standardize the full life cycle operations of their databases, right? It's self hosted, offers a community version as free forever, and supports most open source and preparatory database. You remember what I talked about having devs database environment, right? So with cluster control, you get Redis, you'll be able to monitor Redis, be able to monitor postgres, MySQL, Galleria and several other databases, and also proprietary databases like MySQL server. Also you get a centralized monitoring, you're able to see everything on our environment and it supports various tooling. You can use terraform to orchestrate cluster control and use ansible and properties. So observability with coastal control. So just like I talked about in the previous slide, holistic dashboards, performance monitoring, operational reports for detection, query node balancer monitoring and activity center. So let's look at each of them more in depth. So, holistic dashboard. So with cluster control, you can view your database cluster, even if it's in a multi cloud environment, you can see metrics, right? So from your database, and not just, not just metrics, right, but metrics you need, right? Not just graphs, but graphs, you need graphs that important simplified and you'll be able to track queries and statements and also gives you server level matrix and network to impute an output operations and uptime of your server that host your database and also load balancer monitoring. This is what, this feature is what many monitoring tools don't offer, right? So they give you monitoring and observability on your database nodes. But then again, in today you really see just one database instance, you see multiple database instances and how will you like know what's happening, right? You use a load balancer, you don't know what's happening in your load balancer. So that load balancer monitoring is very essential to be able to see traffic distribution, to response time, right? Throughputs, latency and all of that features, all of that necessary metrics, unnecessary statistics in order to keep your database role performing and enhanced uptime, right? Performance monitoring. So this is a very key feature that some of these monitoring tools offer, right? So advisors, they advise you on console control, for advisors to advise you containing status and justification. So like something happens right. You get an advice and an advice is sent to you, right? With okay, this is what happened, right. This is what you should do and this is why you should do this. Right. Those are database growth, right. Db growth to able to see the growth of your database and tables over time. Right. Right. On a daily basis, you see, okay, is this database is growing more than the other in order to make take actions on knowing what is happening. Because the entire idea of observability is to know what's happening in your environment and also database variables know the configuration of your database. Transaction deadlocks, no, say database, right. Handles transactions. Deadlocks do happen. So in order to view that then query monitoring, this is very important and very essential because like queries are how we interact with databases, right. And for query monitoring with custom control, you get to see top queries, right? Ordered by occurrence, execution time, know which queries slowest, which queries most common depends on how you want to configure. And also queries do. Some queries do long, some work that when I say it's possible for a query not to be well written, right. It happens more than often or not to be optimized, right. So you know, you get to see, right, which queries and outlier are taking longer than normal to run. You get an education, you get an advisor on that and database connection. You had to know the processes and connections of your database and monitoring agents, right? You can install monitoring agents in order to view to take this metrics and take this data. Costa control uses promotions which gives you that customized that you give you the ability to customize, right? And also build out, let's say you want to build an external, you want to build another dashboard for a particular metrics that's not crow guard, right? Over a particular need. You can plug in Grafana and build out the dashboard on that. So both detection and response, I think you get the idea already in real time. You get notification of what's happening, right? This happens on an advisor or an alarm that what you need to do at actual recommendation analysis or your database. You can see from this screenshot alarm details and cost control also handles recovery. Right. You know, we talked about, I talked about the limitation of legacy monitoring tools. So this is a key feature in the sense that cluster control handles the recovery for you. In the senses that, okay, with the legacy monitoring tools you can see, okay, this node is down, this database instance is down. And then you have to go and bring it back or you have to go and restart the node or rebuild the node or let's say there is a replication, right? And the replica, replica mode is. Replica node is broken, right. So when that breaks, you have to go and manually change that, right? So with closer control, there's automatic failover these, you have the ability to rebuild your node, right, directly from the cluster control dashboard. And also it's self feeling in the sense of that if you have a cluster and you have nodes, the automatically failover, you have the ability to heal your cluster in case of, okay, during installation or during upgrades, right. There was dependency mishaps, right. Cluster control can resolve those kind of issues in order to make sure your database is always up. And if you choose to manually intervene, right. Or manually do changes, make changes, you can also schedule maintenance, right? Schedule maintenance and manually make changes. And like recovery, of course, maybe through backups, right. Or however the situation may be and also scaling, you can scale your database, right? Just from the construct cost poster control dashboard. So in the sense that you, you don't need, you bring in another tool or have to, like as a developer or have to go. Let's say you have a DBA in your team and activity center, right? So you see all alarms, right? You see all logs, you get all notifications, you can integrate your notifications with Slack, you can integrate with Obgenie, you can integrate any incident management tool you use, right? They are welcome. Let's say you perform a recovery with cosmic control, you perform an action control, you build a node, you can see all that action, right? And also with logs, you can see logs or did logs too, right. To know who perform an action. Because someone like if many people can have, you can create users and different users to your cost of control environment that have access. They can be able to make changes. And it gives you a virtual DBA experience, which makes you to deploy faster and move faster and resolve issues faster, right. Then operational reports, this is very, this is very, this is very important, right? So let's say you go over the weekend, then you're back from the weekend, let's say long weekend, right? And you're back from the weekend and you, you want to get started working, right? You can easily create an operational report based on database availability, system report or backup report. And then, okay, you see over time, right? We'll go do, I will do a demo and show you this practically, right? How you can create, design and view it. You see over time how test control shows your database at a certain time. Let's say something occurred, which recommendation of recovery actions to carry out, right? So different types of reports from incident report, error reports, backup reports, availability reports. And that gives you assurance you're starting your day, you know what happened, right, over time, let's say, going on like the example of weekend. Going on weekend, right? And that's, and this reports can also be shared, it can be emailed. And all this gives you that assurance that your database is well performing. Your databases. Your databases is, it's up, right, and well performing. And you be sure that you're confident that you're giving your, your customers the most, the best service possible with less time, right. With less cognitive load, with less manual intervention, with less cost, right. And you get better customer satisfaction and also developer satisfaction, the developer productivity. So now we'll go into a live demo and just see how cluster control can help you observe your cluster and simplify database troubleshooting. So yeah, I have a cluster control environment, right? So from the onset you get this cluster overview. Status overview. So you see all the nodes, right? So I have two database clusters here, of course, Resqo and MySQl cholera cluster. So I can see there, this courses have, these two clusters have 17 nodes, right? So I can see, okay, this is a galera node. This is your primary node. I can also see, okay, this is Prometheus node. Then I, okay, then next node I can see, this is the load balancer, right? Okay, from, from the onset, right, you get that full single pane of gas overview of all your clusters, right? That's Costa control, it's manages. So you can also, you can choose to like deploy a new cluster, you can import a cluster, right, for this environment I've imported like these two clusters. So I will go into these clusters individually, right? And, and show you how closer control, like all the features I talked about before this demo. So let's go into the Galera cluster. So I click, okay, yeah, I'm in the Gallera cluster. So yeah, I'm in a gal close. I can see the system overview, right? So what, what operating system is this cluster control running? I guess is running on rocky, uh, Linux. I can see the four calls, I can look at each galara node, right? Can see, okay, this glaring node has two calls to the, of approximately four gig running. It's been up and running for three days. Can see the cpu usage. I can see the ideal ideals, ideal times, the network usage. Okay, this, there was more, there was more network usage. I can see, I can see the dis input and output, right? I can also go into each of these, right? And, and zoom in. And when I zoom in. I, oh, I don't know if you noticed all the other graphs, right. Zooms in with respect to this view, right. This timeline, right. Then also I can, I can sort this matrix over time, right? So I can come, I put the date range. So I do, I want to see the matrix from 15 minutes ago. I can sort that. I don't want to see the matrix for the last seven days. But you know, this database has already been already, this course has only been up for 3d, right? So you can see, starts just on the 27th. So you get to see like an overview on what, what is necessary, right. Then you can go in and see for the pool, the load balancer, right? And also aside from the system overview, you can also look at, okay, the MySQL server specific data, right? So you can see the queries, you see the queries are being run, the connection usage, right. You can, you can see, you can see also this is sort of last seven days. So I can put it to 1 hour, okay, there, and there are other metrics, right? Other dashboards I can also look at. That gives me complete overview of my closet. And I can go into the nodes, I can see all the nodes in the cluster, right? And see the, I can see the price promises node, the proxy SQL node. Then I can click on it and get an overview of the request, right? Just like we talked about. I talked about load balance monitoring. You can see that only, only one of the three nodes are selected to be writes and exercise, okay? They need all nodes receive reads combined with the writer node, writer and reader. Then you can see that the code script is these two nodes, right? They are not an active writer, they are backup nodes, right, because they were shown then for, for performance monitoring, right. You can look at the DB code. So these are test based tables, right? Test database, right. So you can see the largest database, can see the table count rules, the data size, index size. You can look at, you can see the data DB status, you can see the data DDB variables, the connection of the databases. Then you can also look at the query monitor, right? So query monitor, like we talked about, can evaluate this. Take this Dani test data. So you can see the is queries, right? It's sort of a total execution time. Execution time. So the query counts and you see the respective dbs, right, that these queries, there's basic degrees that these queries are set to. Yeah. You can also, you can also update the settings. You can also go to the advisor section right there. You know, advisors is one of the key features I talked about. So you see a lot of advisors. So green obviously means that everything is good. That is okay. Then, you know, you see yellow which is like kind of like a warning, a warning status. So we can see this morning, right. The one is table access without using index. See the specific instance that, that this one in stopped about, right. Specific database node. And we can see the justification, right. And everything you need to understand, okay, this table has been query without using this index. You see that this is the database, right? With a total of 7000 IO operations, right. And it gives you the exact number of, the exact number of rides and the exact number of deletes, right. And you can see this for, and several other types of advisors, right. Say check DB accounts without the password. Oh, okay, this excessive cpu usage, it shows you that, okay, days and excessive cpu, this is something you need to look at, right, and other advisors. And the good thing is you can also customize, you can also customize these advisors, right. So if I go here to manage, I can go to script and I can also like make changes, right, to cpu's. So now for this, right, this is written in CosA control domain specific language, right. You similar to JavaScript and easy for developers have development experience. They can easily understand this, right. And make changes. So now I see, okay, this is a variable with threshold one in 90. So I can change this to okay, let's say two unless compile and run and see what happens. So I see, okay, then I get, you know from the previous as well, I saw that CPU usage was okay. But now seeing the threshold, right, was set to nine, I see that I get the pump that excessive CPA usage, CPU's and cpu usage has been high, right? And I can also, let me edit this back to 90, I can see now that this is a new, new, a new status, right. An alarm and say okay, CP usage, cpu usage is okay. So depending on your needs, depending on what is, what is, what is the threshold for you, you can actually like edit, edit advisors, edit your configurations for your clusters, right, for your load balancers. Like you can see here, max scale for your different dbs and dig specific usage. Let me, let's look into that. You can edit stuff from your needs, right? So now let's, let's look at um, now, so now let's look at uh, alarms, right? So I can go to the alarm tab. So I see that uh, there's no alarm here, right? So in order for us to get an alarm, let's simulate a long running query, right? So I go to the nodes from coastal control, I can ssh directly into the nodes so I don't have to go to my server provider or like cloud provider. So I can ssh into this primary node, right? So let me, let me log in then log into my square, enter the password. So let me run this sleep query, right? Let me run the test sleep query. So yes, we have this query. So I go back to the console control dashboard and just wait for a few seconds, right. And I will see that the alarm pops up. So after a few, after a few seconds, right, you see down pops up, right? Okay, the title long one long running query, right? The warning, what category is it on? DB performance. So you can look in more into the long running query. I can see very the exact query that I run and that shows you like any operations, right. You get a full overview on like what actually happened, right. And recommendation is to tune the query, right? So this is just a slip, right? After 6 seconds it stops. So I can choose to mute this query, I can ignore this query and the notification goes away, right? And I muted it. So because I don't want to worry about, this is a demo environment right? Now let me show you how cursor control handles for detection. So I'll go into the postgresql node, right? So there's no alarm. Yes, I'm going to the node. So you see there's one primary and replica, right? So what I want to do, right? So in this, my control for this node, I have auto recovery turned on, right? So cluster, auto recovery node, auto recovery. So I can choose to turn them off, right? So I can choose to turn them off but I want them on for this. So I can, let me say, okay, let me stop this node, right? We have primary node, right? So I stopped the node I want to first stop, right, stop. So the job scales gets created, I see, I can see that the job gets created. I can go and see that, okay, stop. No, the new job is created, you can see as the job, right is running and I see restoring auto recovery, true settings is true, right? So I go back to the node, right? See node, I shut down the node, I get an alarm that closer failure, critical cursor recovery is needed, right after, after. And after a few minutes you see that my gang replica node became a primary, right? So it happens fast. And let's go look at the logs, right? So we see that, okay, recovering the master as all hosts are down, so crystal control automatically recovers the node. It recovers the node and starts, uh, the node again, right? And here you see from it being filled, uh, my cluster from cursor failure. You see the cursor failure alarm is gone. Then the cursor is back up and running, right? And the initial replica, replica node, right, became a primary. And the primary became the replica. Now that we've covered most of the parts, right, we've covered forward detection, we've covered query monitoring, performance monitoring, looked at all the dashboards that cost to control provides you and Costa controls, recovery features and how you can schedule maintenance. So now like let's look at activity center. So while doing all these, right, you know, we looked at the alarms logged and jobs drive, so that makes up the activity center. So we can go to the activity set. I see currently there's no alarm, right? There's no alarm because, right. There's nothing, there's no issue right now. And if we want to simulate again, right, like we saw, we see another alarm pop up. But we can also view all the jobs, right, with respect to each cluster. So you can see, let me just zoom in a bit. You can see the title of each job. You stopped the node, right. There was the noise recovery backups that previously created, right? And we can also look at audit logs, right, that shows like who did what, right? So you have a cluster contour environment. It's not only managed by one person, right? So you see that who did what, which user did what. So now this is only one user I've used, right? So you only see CC admin. But if there's another user, right, you can always create users, right? Can easily create a user which poster control as I create a new user created team, assign the user to team. I can also implement LDAP organizations already active directory, so you don't need new credentials to access lesser control. So it kind of just suits that, kind of just the world kind of just slots into like security, current security regulations, compliance can set your make LDAP settings and map out group with respect to each team, right then, but back to activity center, right? So that's what a customer to give you to see what is happening in the cluster, right. And also to perform actions and also operational reports, right? So yeah, I've really created some operational report. So let me create another one to show you how closer control can help you, like see what happens with you, with your cluster, right? So I can select, okay, let me select the postgres cluster. Then I just say database availability report. I can put an email for recipient. I can say for three days the growth rates they change. So I can create this report and within a few seconds the report is created. Then I can view this report, right. And it gets open a new page. I say okay, what's the availability of this reports of this post? My postgres cluster. So 99.98%. Right. And the uptime, was there any downtime? There was no downtime. But you remember I simulated costs are going down a node rather a note, I simulated node going down, right. But then again you see there was zero minutes downtime because of consecrate auto recovery feature. Then you can, you can also create more operational reports, different types of operational reports, right. Schema changes, database groups, capacity reports, standard system reports, just like give you overview of your system like CP, Neuro and usage, right. And let's look at backup reports and any kind of reports, right. So I can create, okay, let me create a backup report for the Galleria cluster. So I create that I can view it. So you see, okay, the backup was completed. So you see detailed overview, right, of all the backups, backup size, storage location. I store this backup in my, in the roots, in the, in the server. Right. In the note, right. But then again you can also store your backup on AWS and you also see the aws, other clouds. You also see the, the location of the backups, right. One, the backup is kitted. You can create verifiable backups, right. And see backup rotation. Right. The back of rotation is case one days. And this gives you an understanding of what's happening with cloths that, right. What happened before, what's happening now. And that, that gives you the satisfaction of that, okay, how your cluster is up now, right. And it's well performing. And if anything happens, you can easily react. Right. It's automated reaction and you can obviously react manually if need be. Right. You see I was easily able to easily go into a node, right. And ssh into a node directly from the coastal control environment, right. So that is pretty, that is pretty helpful. In case of some emergency situation, I can now perform actions, right. So now you see how closing module can help you simplify database troubleshooting and see what's happening in your cluster in real time and also perform actions immediately and how for you to have a well performing and highly available database. So now we're going to look at who uses cursor control for observability. So cross culture is trusted by a lot of organizations. Some you know, some you don't. So I'm just going to look at two examples, right. Two case studies for organizations that use cluster control specifically. They adapted cluster control majorly for like observability and monitoring. So the first is westp. So they had a challenge while growing in the market. Westpit is a swedish finite company, right? They were not able to know what's going on in their closer. They spend a lot of time tweaking and manually making changes, right? And manual customization, which rich man, you know, manual customization can lead to key person dependence, right? So let's say the person that did those tweaks and those changes move on from the company or like something happens, the person is sick at a particular time, right? Like if the person's dependence is something that every organization tries to avoid and should avoid and cost of control when they are, when WesTP adopt control. So this is the feedback of USB CTO. They were able to ensure highest possible uptime, right? Because they knew what was happening in a database when you knew when their database was, they knew when log queries were occurring, they knew when the database was running exponentially and what to do. They knew when, where, when the node was down and there was auto recovery and all this. So you can read more about the, the case study there by scanning this QR code. We also have another example with holiday pirates, right? Managing and scaling their database was a database was a challenge for them and immediately solution that gives them that holistic view, that single plane of class and full understanding of what's happening and being able to take action. And now they are modified of the substance of their database and they can take corrective and preventive measures, right? So like you saw in the demo, closer control as a, you know, you see some notifications, some alarms are like warning warnings and you have the red, red obviously, so like failure. All right. Very critical alarm, right? We didn't simulate that in them, but yeah, very critical alarms. And you have grid for status. All right, so I can get this notifications on slack integrated notifications, these alarms and notifications to any incident management tool you use, right. You can read more about this case study on by scanning this query code. Then that is the end of my talk. And thank you very much for listening and I hope this talk was interesting and insightful. And if you're looking to like adopt us to control, you can easily like reach out to me, I on the Talk description, you can see my socials on LinkedIn and also on Twitter. And also we host at several nights we host the Sovereign Debass podcast. So sovereign divers like I mentioned earlier, enabling companies to take control of their data right. So community of individuals from different companies that prioritize data sovereignty, having control, not just data sovereignty on the basis of location of where your data is. Right. Or control over the entire data stack, right. Your infrastructure, the tooling around your database. So there are so many episodes, a recent episode on using an open source cloud, right. So you can check it out on Spotify on YouTube at several nights. And you can always reach out to me, I'm always active on social media and yeah, I hope this was a great talk and was interesting for you to listen. Do have a great rest of the observability, comfort to observability.
...

Divine Odazie

Technology Evangelist @ Severalnines

Divine Odazie's LinkedIn account Divine Odazie's twitter account



Awesome tech events for

Priority access to all content

Video hallway track

Community chat

Exclusive promotions and giveaways