Transcript
This transcript was autogenerated. To make changes, submit a PR.
Hey everyone, my name is Divine Odaz eight. Today I'm going to be
talking about how you can go from metric tsunami to actionable insights
to simplify your database troubleshooting.
So a little bit about me I'm a technology evangelist at several nines,
a company that empowers organizations to take control
of their data on every environment, giving them a
full lifecycle, dbas experience and mostly certified Kubernetes
application developer is certified PWA solution architecture
and I'm very much excited to speak at Con 42 observability
this 2024. So here's
the agenda for my talk. So we're going to start by discussing
the developer dilemma and also limitations of
legacy database monitoring tools. And then we'll talk about
how you can simplify database troubleshooting with
cluster control. And look at case study for organizations
that are using cluster control to monitor and simplify
their database troubleshooting.
So is this you, are you drowning database metrics?
Do you feel like you're not empowered by overwhelmed?
When issues arise,
you have to navigate through multiple tools for database observability.
This is because of the limitation of legacy database monitoring tools.
So yeah, first the major limitation is graph overload.
Yet these tools have been able to capture a lot of metrics and data
from system operations and database operations.
They've also be able to visualize it, right. In beautiful graphs.
But then again, as developers who control much of the environment
today, right, you don't, most developers don't
understand most of the metrics, right? They are being given,
they just want simplified, actionable insight,
right? Actionable metrics, what the, what they really need to solve their problem
as at that time, and also limited DB support.
So you, you have to use it to,
to monitor your database, right? Your postgres database.
But then again, most organizations don't use just
one database, right? So then again you need another tool
for MySQL, you use Redis, you need another tool for redis
or you use any so
many other database, there a lot of database around for specific problems.
And most organizations use diverse database
environments. So you're stuck with the reason of multiple
tools, which brings about cognitive overload, right,
reduces developer productivity. And another major limitation is
visibility without action. So yes, that is the default, right,
monitoring. But then again, beyond monitoring, beyond observability,
you need to be able to like take actions and like resolved issue,
right? So most of these tools just offer that monitoring,
right? The legacy to software, the monitoring which makes it that you
have to like go implement either manual intervention
or using a different tool to get. I dreams about
cognitive load, cognitive overload. And so
yeah, if you're tired of the limitations of these legacy
tools, you can try cluster control. And that's what we're going to look at
in this talk. So next, simplifying database
troubleshooting with cluster control. So cluster control empowers
you organizations to standardize the
full life cycle operations of their databases, right? It's self hosted,
offers a community version as free forever, and supports
most open source and preparatory database.
You remember what I talked about having devs database environment,
right? So with cluster control, you get Redis,
you'll be able to monitor Redis, be able to monitor postgres,
MySQL, Galleria and several other databases, and also proprietary databases
like MySQL server. Also you get a centralized monitoring,
you're able to see everything on our environment and it supports various
tooling. You can use terraform to orchestrate
cluster control and use ansible and properties. So observability
with coastal control. So just like I talked about in the previous slide,
holistic dashboards, performance monitoring,
operational reports for detection, query node
balancer monitoring and activity center. So let's look at each of them more
in depth. So, holistic dashboard. So with
cluster control, you can view your database
cluster, even if it's in a multi cloud environment, you can
see metrics, right? So from your database, and not just,
not just metrics, right, but metrics you need, right?
Not just graphs, but graphs, you need graphs that important
simplified and you'll be able to track queries and
statements and also gives you server level matrix and network to
impute an output operations and uptime of your server that host
your database and also load balancer monitoring. This is what,
this feature is what many monitoring tools don't offer,
right? So they give you monitoring and
observability on your database nodes. But then again,
in today you really see just one database instance,
you see multiple database instances and how will you like
know what's happening, right? You use a load balancer, you don't know
what's happening in your load balancer. So that load balancer monitoring is very essential
to be able to see traffic distribution, to response
time, right? Throughputs, latency and
all of that features, all of that necessary metrics,
unnecessary statistics in order to keep your database
role performing and enhanced uptime, right?
Performance monitoring. So this is a very key feature that
some of these monitoring tools offer, right?
So advisors, they advise you on console control,
for advisors to advise you containing status and justification.
So like something happens right. You get an advice and
an advice is sent to you, right? With okay, this is what happened,
right. This is what you should do and this is why you should do
this. Right. Those are database growth, right. Db growth to
able to see the growth of your database and tables over time.
Right. Right. On a daily basis, you see, okay,
is this database is growing more than the other in order to make
take actions on knowing what is happening. Because the entire idea
of observability is to know what's happening in your environment and
also database variables know the configuration of your
database. Transaction deadlocks,
no, say database, right. Handles transactions.
Deadlocks do happen. So in order to view that then query monitoring,
this is very important and very essential because like queries
are how we interact with databases, right.
And for query monitoring with custom control,
you get to see top queries, right? Ordered by occurrence,
execution time, know which queries
slowest, which queries most common depends on how you
want to configure. And also queries do.
Some queries do long, some work that when I say it's possible
for a query not to be well written, right. It happens more
than often or not to be optimized, right. So you know,
you get to see, right, which queries and outlier are
taking longer than normal to run. You get an education,
you get an advisor on that and database connection. You had to know
the processes and connections of your database and
monitoring agents, right? You can install monitoring
agents in order to view to take this
metrics and take this data. Costa control uses promotions which
gives you that customized that you give
you the ability to customize, right? And also build
out, let's say you want to build an external,
you want to build another dashboard for
a particular metrics that's not crow guard, right? Over a particular
need. You can plug in Grafana and build out the dashboard
on that. So both detection and response,
I think you get the idea already in real time. You get notification
of what's happening, right? This happens on an advisor or an alarm
that what you need to do at actual recommendation analysis
or your database. You can see from this screenshot alarm
details and cost control also handles recovery.
Right. You know, we talked about, I talked about the limitation
of legacy monitoring tools.
So this is a key feature in the sense that cluster
control handles the recovery for you. In the senses that, okay, with the legacy monitoring
tools you can see, okay, this node is down, this database instance is down.
And then you have to go and bring it back or
you have to go and restart the node or rebuild the node or let's say
there is a replication, right? And the replica, replica mode is.
Replica node is broken, right. So when that breaks,
you have to go and manually change that, right? So with closer control,
there's automatic failover these, you have the ability
to rebuild your node, right, directly from the cluster
control dashboard. And also it's self feeling in the sense of that
if you have a cluster and you have nodes,
the automatically failover, you have the ability to heal your
cluster in case of, okay, during installation or during upgrades, right.
There was dependency mishaps, right. Cluster control can resolve
those kind of issues in order to make sure your database is always up.
And if you choose to manually intervene, right. Or manually do changes,
make changes, you can also schedule maintenance,
right? Schedule maintenance and manually make
changes. And like recovery, of course, maybe through backups,
right. Or however the situation may be
and also scaling, you can scale your database, right?
Just from the construct cost poster control dashboard.
So in the sense that you, you don't need, you bring in another tool or
have to, like as a developer or have to go. Let's say you have a
DBA in your team and activity
center, right? So you see all alarms, right? You see all logs,
you get all notifications, you can integrate your notifications with Slack,
you can integrate with Obgenie,
you can integrate any incident management tool you
use, right? They are welcome. Let's say you perform a recovery
with cosmic control, you perform an action control,
you build a node, you can see all that action,
right? And also with logs, you can see logs or did logs too,
right. To know who perform an action. Because someone like if
many people can have, you can create users and different users to
your cost of control environment that have access. They can be able to
make changes. And it gives you a
virtual DBA experience,
which makes you to deploy faster and
move faster and resolve issues faster,
right. Then operational reports,
this is very, this is very, this is very important, right? So let's
say you go over the weekend,
then you're back from the weekend, let's say long weekend,
right? And you're back from the weekend and you,
you want to get started working, right? You can easily create an operational report
based on database availability, system report or
backup report. And then, okay, you see over time,
right? We'll go do, I will do a demo and show you this
practically, right? How you can create, design and view
it. You see over time how test control shows your database
at a certain time. Let's say something occurred, which recommendation
of recovery actions to carry
out, right? So different types of reports from incident report,
error reports, backup reports,
availability reports. And that gives you assurance you're
starting your day, you know what happened, right, over time, let's say,
going on like the example of weekend.
Going on weekend, right? And that's, and this reports can also
be shared, it can be emailed. And all this gives
you that assurance that your database is well performing. Your databases.
Your databases is, it's up, right, and well
performing. And you be sure that you're confident that you're giving
your, your customers the most, the best
service possible with less time, right.
With less cognitive load, with less manual intervention,
with less cost, right. And you
get better customer satisfaction and also developer satisfaction,
the developer productivity. So now we'll go
into a live demo and just see how cluster control can
help you observe your cluster and simplify database
troubleshooting. So yeah, I have a cluster control environment,
right? So from the onset you get this cluster overview.
Status overview. So you see all the nodes, right? So I have two database
clusters here, of course, Resqo and MySQl
cholera cluster. So I can see there, this courses have,
these two clusters have 17 nodes, right? So I can see,
okay, this is a galera node. This is your primary node.
I can also see, okay, this is Prometheus node. Then I, okay,
then next node I can see, this is the load balancer, right?
Okay, from, from the onset, right, you get that full
single pane of gas overview of all your clusters, right? That's Costa control,
it's manages. So you can also, you can choose to
like deploy a new cluster, you can import a cluster, right,
for this environment I've imported like these two
clusters. So I will go into these clusters
individually, right? And, and show you how
closer control, like all the features I talked about before
this demo. So let's go into the Galera cluster.
So I click, okay, yeah, I'm in the Gallera cluster.
So yeah, I'm in a gal close. I can see the system overview,
right? So what, what operating system is this
cluster control running? I guess is running on rocky,
uh, Linux. I can see the four calls, I can look
at each galara node, right? Can see, okay, this glaring node
has two calls to the, of approximately four
gig running. It's been up and running for
three days. Can see the cpu usage. I can see
the ideal ideals, ideal times, the network usage.
Okay, this, there was more, there was more network
usage. I can see, I can
see the dis input and output, right? I can also go
into each of these, right? And, and zoom
in. And when I zoom in. I, oh, I don't
know if you noticed all the other graphs,
right. Zooms in with respect to this view,
right. This timeline, right. Then also I can,
I can sort this matrix over time, right? So I can come,
I put the date range. So I do, I want to see the
matrix from 15 minutes ago. I can sort that.
I don't want to see the matrix for the last
seven days. But you know, this database has already been already, this course has
only been up for 3d, right? So you can see,
starts just on the 27th. So you get to see like
an overview on what, what is necessary, right.
Then you can go in and see for the pool,
the load balancer, right? And also aside from
the system overview, you can also look at, okay, the MySQL server
specific data, right? So you can see the queries,
you see the queries are being run, the connection usage,
right. You can, you can see,
you can see also this is sort of last seven days. So I
can put it to 1 hour, okay, there,
and there are other metrics, right? Other dashboards I can also look
at. That gives me complete overview of my closet.
And I can go into the nodes, I can see all the nodes in the
cluster, right? And see the, I can see the price
promises node, the proxy SQL node.
Then I can click on it and get an overview of
the request, right? Just like we talked about. I talked
about load balance monitoring. You can see that only,
only one of the three nodes are selected to be writes and
exercise, okay? They need all nodes receive
reads combined with the writer node, writer and reader.
Then you can see that the code script is these
two nodes, right? They are not an active writer, they are
backup nodes, right, because they were shown then
for, for performance monitoring,
right. You can look at the DB code. So these are test based tables,
right? Test database, right. So you can see the largest
database, can see the table count rules, the data size,
index size. You can look at, you can see the data DB
status, you can see the data DDB variables,
the connection of the databases.
Then you can also look at the query monitor,
right? So query monitor, like we talked about, can evaluate this.
Take this Dani test
data. So you can see the is queries, right?
It's sort of a total execution time.
Execution time. So the query counts and you see
the respective dbs, right, that these queries,
there's basic degrees that these queries are set to.
Yeah. You can also, you can also update the settings.
You can also go to the advisor section right there. You know,
advisors is one of the key features I talked about.
So you see a lot of advisors. So green obviously
means that everything is good. That is okay. Then, you know,
you see yellow which is like kind of like a warning, a warning status.
So we can see this morning, right. The one is table access without
using index. See the specific instance that, that this
one in stopped about, right. Specific database
node. And we can see the justification, right.
And everything you need to understand, okay, this table has been
query without using this index. You see that this is the database,
right? With a total of 7000 IO operations,
right. And it gives you the exact number of,
the exact number of rides and the exact number of deletes, right. And you
can see this for, and several other types of
advisors, right. Say check DB accounts without the
password. Oh, okay, this excessive cpu
usage, it shows you that, okay, days and
excessive cpu, this is something you need to look at,
right, and other advisors.
And the good thing is you can also customize,
you can also customize these advisors, right. So if I go here
to manage, I can go to script and I can also
like make changes, right, to cpu's.
So now for this, right, this is
written in CosA control domain specific language,
right. You similar to JavaScript and easy
for developers have development experience. They can easily
understand this, right. And make changes. So now I see, okay, this is
a variable with threshold one in 90. So I
can change this to okay, let's say two unless
compile and run and see what happens.
So I see, okay, then I get, you know from the previous as
well, I saw that CPU usage was okay. But now seeing the
threshold, right, was set to nine, I see that I get
the pump that excessive CPA usage,
CPU's and cpu usage has been high, right?
And I can also, let me edit this back to 90,
I can see now that this is a new, new, a new status,
right. An alarm and say okay, CP usage, cpu usage
is okay. So depending on your needs, depending on what
is, what is, what is the threshold for you, you can actually like
edit, edit advisors, edit your
configurations for your clusters, right, for your load
balancers. Like you can see here, max scale for your different
dbs and dig specific usage. Let me,
let's look into that. You can edit stuff from your needs,
right? So now let's,
let's look at um, now, so now let's look at uh,
alarms, right? So I can go to the alarm tab. So I see
that uh, there's no alarm here, right? So in
order for us to get an alarm, let's simulate a long running query,
right? So I go to the nodes from coastal control, I can ssh
directly into the nodes so I don't have to go to my server provider
or like cloud provider.
So I can ssh into this primary node,
right? So let me,
let me log in then log into my square,
enter the password.
So let me run this sleep query,
right? Let me run the test sleep query.
So yes, we have this query. So I go back
to the console control dashboard and just wait for a few seconds,
right. And I will see that the alarm pops up.
So after a few, after a few seconds, right, you see down
pops up, right? Okay, the title long one long running
query, right? The warning, what category is
it on? DB performance. So you can look in more
into the long running query. I can see
very the exact query that I run
and that shows you like
any operations, right. You get a full overview on like what
actually happened, right. And recommendation is to tune
the query, right? So this is just a slip,
right? After 6 seconds it stops.
So I can choose to mute this query, I can ignore this
query and
the notification goes away, right? And I muted it.
So because I don't want to worry about, this is a demo environment
right? Now let me show you how cursor control
handles for detection. So I'll go into the postgresql
node, right? So there's no alarm. Yes, I'm going to the node. So you see
there's one primary and replica,
right? So what I want to do, right? So in this,
my control for this node, I have auto recovery turned
on, right? So cluster, auto recovery node,
auto recovery. So I can choose to turn
them off, right? So I can choose to turn them off but I want them
on for this. So I can, let me say,
okay, let me stop this node, right? We have
primary node, right? So I stopped the node I want
to first stop, right,
stop. So the job scales gets created, I see,
I can see that the job gets created. I can go and see that,
okay, stop. No, the new job is created, you can see as
the job, right is running and I see restoring
auto recovery, true settings is true,
right? So I go back to the node,
right? See node, I shut down the
node, I get
an alarm that closer failure, critical cursor recovery is
needed, right after,
after. And after a few minutes you see that my
gang replica node became a primary,
right? So it happens fast. And let's go look at the logs,
right? So we see that, okay, recovering the master
as all hosts are down, so crystal control automatically recovers
the node. It recovers the node and starts,
uh, the node again, right? And here
you see from it being filled, uh, my cluster
from cursor failure. You see the cursor failure alarm is gone.
Then the cursor is back up and running, right?
And the initial replica,
replica node, right, became a primary.
And the primary became the replica. Now that we've covered most of the
parts, right, we've covered forward detection, we've covered query monitoring,
performance monitoring, looked at all the dashboards
that cost to control provides you and Costa
controls, recovery features and how you can schedule maintenance.
So now like let's look at activity center.
So while doing all these, right, you know, we looked at
the alarms logged and jobs drive, so that makes up
the activity center. So we can go to the activity set. I see currently there's
no alarm, right? There's no alarm because,
right. There's nothing, there's no issue right
now. And if we want to simulate again, right, like we saw,
we see another alarm pop up. But we can also view all the jobs,
right, with respect to each cluster. So you can see,
let me just zoom in a bit. You can see the
title of each job. You stopped the node, right. There was the noise recovery
backups that previously created, right? And we can
also look at audit logs, right, that shows like who
did what, right? So you have a cluster contour environment. It's not only
managed by one person, right? So you
see that who did what, which user did what. So now this is only
one user I've used, right? So you only see CC admin.
But if there's another user, right, you can always create
users, right? Can easily create a user which poster control
as I create a new user created team, assign the
user to team. I can also implement LDAP organizations
already active directory, so you don't need new credentials
to access lesser control. So it kind of just suits that,
kind of just the world kind of just
slots into like security,
current security regulations, compliance can
set your make LDAP settings and map
out group with respect to each team,
right then, but back to activity
center, right? So that's what a customer to give you to
see what is happening in the cluster, right. And also
to perform actions and also operational reports, right? So yeah,
I've really created some operational report. So let me create
another one to show you how closer control can
help you, like see what happens with you, with your cluster,
right? So I can select, okay, let me select
the postgres cluster. Then I just
say database availability
report. I can put an email for recipient. I can say
for three days the growth rates they change. So I can
create this report and within a few seconds the
report is created. Then I can view this report,
right. And it gets open a new page. I say
okay, what's the availability of this reports
of this post? My postgres cluster.
So 99.98%. Right. And the
uptime, was there any downtime? There was no downtime.
But you remember I simulated
costs are going down a node rather a note,
I simulated node going down, right. But then again you see there
was zero minutes downtime because
of consecrate auto recovery feature. Then you can, you can also create
more operational reports, different types of operational reports,
right. Schema changes,
database groups, capacity reports,
standard system reports, just like give you overview of your
system like CP, Neuro and usage, right. And let's
look at backup reports
and any kind of reports, right. So I can create, okay, let me create a
backup report for the Galleria
cluster. So I create that I can view it.
So you see, okay, the backup was completed. So you see detailed
overview, right, of all the backups, backup size,
storage location. I store this backup in my,
in the roots, in the, in the server. Right.
In the note, right. But then again you can also store
your backup on AWS and you also see the aws,
other clouds. You also see the, the location
of the backups, right. One, the backup is kitted.
You can create verifiable backups, right.
And see backup
rotation. Right. The back of rotation is case one days. And this gives
you an understanding of what's happening with cloths that,
right. What happened before, what's happening now. And that,
that gives you the satisfaction of that, okay, how your
cluster is up now, right. And it's well
performing. And if anything happens, you can easily react.
Right. It's automated reaction and you can obviously
react manually if need be. Right. You see I was easily
able to easily go into a node, right.
And ssh into a node directly from the coastal control
environment, right. So that is pretty, that is pretty helpful.
In case of some emergency situation, I can now perform actions,
right. So now you
see how closing module can help you simplify database
troubleshooting and see what's happening in your
cluster in real time and also perform
actions immediately and how for you to have
a well performing and highly available database.
So now we're going to look at who uses cursor control for observability.
So cross culture is trusted by a lot of organizations.
Some you know, some you don't. So I'm just going to look at two examples,
right. Two case studies for organizations that use cluster
control specifically. They adapted cluster control majorly
for like observability and monitoring. So the first is westp.
So they had a challenge while growing in the market. Westpit is a
swedish finite company, right? They were not able to know what's going
on in their closer. They spend a lot of time tweaking and
manually making changes, right? And manual
customization, which rich man, you know, manual customization
can lead to key person dependence, right? So let's say
the person that did those tweaks and those changes move on from
the company or like something happens, the person is sick at
a particular time, right? Like if the person's dependence is something that
every organization tries to avoid and should avoid and
cost of control when they are, when WesTP adopt control.
So this is the feedback of USB CTO. They were able to ensure
highest possible uptime, right? Because they knew what
was happening in a database when you knew when their database was,
they knew when log queries were occurring, they knew
when the database was running exponentially and what to do.
They knew when, where, when the node was down and
there was auto recovery and all this. So you can read more about
the, the case study there by scanning this QR
code. We also have another example with holiday pirates,
right? Managing and scaling their database was a database
was a challenge for them and immediately solution that gives
them that holistic view, that single plane of class
and full understanding of what's happening and being able to take action.
And now they are modified of the substance of their database and they can
take corrective and preventive measures, right?
So like you saw in the demo, closer control as
a, you know, you see some notifications, some alarms are
like warning warnings and you have the
red, red obviously, so like failure. All right. Very critical alarm,
right? We didn't simulate that in them, but yeah,
very critical alarms. And you have grid for status. All right,
so I can get this notifications on slack
integrated notifications, these alarms and notifications to any
incident management tool you use, right. You can read more
about this case study on by scanning this query code.
Then that is the end of my talk. And thank you very much
for listening and I hope this talk was interesting
and insightful. And if you're looking to like adopt us to control,
you can easily like reach out to me, I on the Talk description,
you can see my socials on LinkedIn and also on
Twitter. And also we host at several
nights we host the Sovereign Debass podcast.
So sovereign divers like I mentioned earlier, enabling companies
to take control of their data right. So community of individuals
from different companies that prioritize data sovereignty,
having control, not just data sovereignty on the
basis of location of where your data is. Right. Or control over the
entire data stack, right. Your infrastructure, the tooling around your
database. So there are so many episodes, a recent episode on
using an open source cloud, right. So you
can check it out on Spotify on YouTube at several
nights. And you can always reach out to me, I'm always active on social
media and yeah, I hope this was a great talk
and was interesting for you to listen.
Do have a great rest of the observability,
comfort to observability.