Revolutionizing DB OPS with Rapydo

Video size:

Abstract

In this talk, Matan Nataf, co-founder, and CEO of Rapydo, addresses the complexities of managing relational databases in the cloud era. Focusing on the challenges brought by microservices and multi-tenant architectures, he underscores the limitations of current tools in achieving scalability and performance. Nataf highlights the key pillars for modern database management: observability, resiliency, and cost-efficiency. He introduces Rapido’s solutions that offer consolidated visibility and automated query management to mitigate database issues. Finally, a live demo showcases Rapido’s capabilities in optimizing database performance and stability.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey folks, this is Matan. I'm the co founder and CEO of Rapido Technologies. today I want to talk to you a little bit about relational databases in the cloud, the challenges that are coming with the new era of microservices and multi tenant architectures, and why essentially in the cloud world, managing relational database is Broken. I'm grateful to be a gold sponsor of con 42. we're excited to be here and to, be part of your community, at least for the upcoming year, and learn more from your insights, about the devil S. R. E. Cloud engineering aspect of databases, and hopefully to bring value to you. multiple folks from this community using the tools of Rapido. So let's dive in. managing relational databases in the cloud is broken. And why do we say that? the cloud is introduced, companies with new challenges related to managing a auto scalable applications. Kubernetes, auto scaling groups. have all introduced. dramatic flexibility, but severe challenges in any shared everything architecture like relational databases. So in the end of your day, in the end of the day, your application is able to scale automatically by 1000 X. But your database is a one monolithic piece, that is very, very hard to manage. So essentially, your. Databases and your applications in the cloud has evolved and changed dramatically. But the tools that are allowing the DBMS itself has not changed at all. and this is why we believe that the entire way companies are managing relational databases in the cloud today is broken and introducing a lot of challenges related to observability, SLA, Resiliency and costs. the trend and I'm going to skip like the previous slide, but the trend was that in the past, companies managed a small fleet of databases, right? So on Prem, when you have few monolithic applications, it is. Relatively easy to manage to manage your database. You have a full blown DBA team, that know exactly what they do on the databases. the changes in the scale are not severe, so it's not thousands of percents. It's, 10, 12, 15%. So sizing for the peak is not hard, and it's not expensive. and, Due to these factors, managing performance and, tackling issues, was not hard as it is today, right? uh, Meantime to detect meantime to resolve databases issues was not high back in the days when you had a full blown DBA team when you had monitoring systems that monitoring 5 to 10 databases with a The disadvantages, of course, of living in a monolithic world in terms of time to market for new applications in terms of flexibility, launching new services, living in that world for databases side of things was not too bad besides the Oracle or my or or SQL Server license that you had to pay. In today's world, with microservices, micro applications, in many cases, thousands of micro applications, managing tenants, became, became, managing tenants became, great opportunity to accelerate time to market for new applications, new features, substantial flexibility, right? Like building and deploying new features became. so fast that we deploy new features to our SAS applications in many cases, you know, every day and, we all rely on open source databases like MySQL, Postgres, and it became actually a great, A great world for managing applications, but alongside the improvements and the value that we gained in the auto scalable world in the cloud native world, a lot of changes came to teams managing RDS, In the cloud, so or relational databases in the cloud, reducing when you manage a fleet of databases, it is very hard to reduce downtime and improve performance, right? So tracking 5000 databases and understanding where exactly. your CPU is close to the edge. It's not easy, right? Gaining visibility across all databases in a single dashboard. This is something that many companies are seeking to gain, but using the current tools like RDS Performance Insights or, or Datadog is not easy. It's not easy to see that level of visibility across a fleet of databases. Finding and addressing relational, finding and addressing slow queries. across your entire fleet. So this is hard. And due to that, we see mounting costs and many companies are just pouring more and more money on databases in order, in order to reduce SLA and resiliency issues. And this really makes companies becoming, so if you think about a tool, a native tool in AWS RDS performance insights, right? It gives you visibility only to one DB instance. And when you look on one DB instance, when managing a fleet of 5000 databases, you are by definition reactive and not proactive, right? And one of the things we always want to do when we manage such an important assets, like such an important asset, like our databases is to be able to be proactive. So even when you can find the root cause, when you have a fleet of databases, What can you do about it, right? Sometimes you have legacy code, you have a a, a low, you have a, a, you have big r and d organization. what can you do as an operational team when you see A CPU or IO spikes or a problematic query? In the end of the day, the only thing you can do is to ship that to a, To your dev team, right? So what we believe the ideal tool, will have, right? Like, what are the properties of the ideal tool? So the three pillars, of modern, Database a management tool will be observability. So improving observability across thousands of instances, improving resiliency with no code changes and maintaining a sane coast. On your database, right? So, so if you will look on every AWS or Google cloud bill, RDS and cloud SQL typically be the second or the third line item. So observability, resiliency and cost are the three legs of this tool when managing a relational databases in the cloud and, and missing any of them, will not get you the results you need. Now let's, let's dive a bit further into. into observability, right? So what is exactly, what is it exactly that we miss today in today's world in the existing tools? So, so first, it's no longer effective to have a look at each database individually, right? We need, we need a solution that looks across all of our databases and, and pools. Pulls all the metric data into a single consolidated view, right? So, essentially, you need to see what are my slowest queries across my entire fleet. What, which DBs are, we want to see the top CPU or the highest CPU load right now across my entire fleet. We want to see the highest memory IO consumption across my entire fleet. we want to set alerts and get notified, if certain thresholds are hit. we want to see. whether we are running out of database connections, right? So db connections are limited. And 500 or 5000 databases, being able to track where do I have shortage on connections is very hard. So this is another thing we want to manage. we want to get visibility into table locks and deadlocks across 5000 of database, 5000 databases, right? so, so these, these capabilities these. Are something you must have when managing a, a SAS application when managing databases for SAS application. the second pillar will be resiliency. So, so. The tool that you use must be able to put guardrails, in place to protect the databases from those type of issues cause outage, right? Like, when your CPU spikes, right, to 100%, you've got to be able to protect your database from these spikes. when bad queries are hitting your database, you've got to be able to respond fast and, and, and rewrite those queries. In transit, when bad code is sending too many queries or too many requests, the same query over and over to your database, you've got to be able to to throttle that or to or to catch that when you access idle connections. you get, you get, you get ran out of connections, you get run over, you get ran out of availability, right? So you've got to be able to protect your database from. From shortage of connections, these capabilities must be a, a, an integral part of of your, your database fleet, right? Of managing your databases in the cloud. Now, on top of that, you got to be able to not pour. Tons of money on your database to do that, right? So you got to be able to make sure that your CPU and IO costs are not going through the roof. You got to be, you got to have a capacity, a capacity planned that will not be based on your CPU spikes. You got to be able to protect your database from CPU spikes that will kill your application. So, so again. Resiliency, observability, and maintaining low costs or same cost in the database are what you must have when managing fleet of databases. So this is guys when I introduce Rapido to you. So Rapido is a Platform that allows company. It is fully deployed in your VPC. It allows companies with large fleet of databases to manage a managed visibility observability. So, as you can see, rapid or scout, our observability platform will connect to all your Postgres and MySQL databases and start aggregating data about their processes about, Infrastructure metrics like CPU IO memory and consolidate them in one one pane of glass. On top of that, we have cortex, which is our automation layer that allows companies to offload repetitive queries from their database via a proxy and also via the connections that we manage. So you could you could deploy caching throttling. Our query, rewrite a lot of capabilities in one tool in one platform. So, so let's now let's now move to a live demo of Rapido and see the platform in action. So guys, this is, this is the Rapido console. and as you can see, our master dashboard will gain, will get you visibility to the most pressing issue issues across your entire fleet of databases. let me move to 48 hours so we could have more data. Okay. and and as you can see, these are the most pressing issues across my entire database fleet in the last 48 hours, right? So we can see the top or the or the slow or the average query runtime across my entire fleet, right? So the average query runtime is 12 seconds. But when I drill down, I can see that I have. some queries with average runtime of 120 seconds or two minutes, and I have some queries with sub second or second runtime. and again, every line item that you see here is a query on a database on a DB instance. On top of that, you will be able to see the top CPU consumers across your entire fleet. You will be able to see a shortage on connections. You will be able to see locks and deadlocks across your entire fleet. across your entire fleet. in every column that you see, I just drill down to my deadlocks, or to my locks view. and if a group by we love the group by capability, right? So if a group by you can easily see that my postgres database is by far the most suffering entity from looks across my entire fleet. So using rapid, you can actually gain visibility to postgres and my sequel instances in one pane of glass, right? So, My, my postgres database is the first. All the others are, are my SQL database. Now I can extend that and I can see whether I'm facing a transaction look or metadata look. I can, I can aggregate by, the current statement. So I can see that there are 2 statements. There is 1 update and 1 select on my postgres database that is generating a lot of looks. at this point I can. Use Rapido or a engineering tools to fix that. That's looking right now. As you can see here. we have, we had, we had, we had a, a CPU reduction on multiple databases, but we have one database here that is running on 99 to 100. CPU utilization most of the time. Right now, at this point, guys, I can move to the real time view of Rapido. And as I mentioned here, you can actually see a flat list of all your Postgres and MySQL databases in in one pane of glass. I can group by again and see that they have one Postgres instance and 11 MySQL instances. And we can see that my demo database here. It's a MySQL 8 database. D. B. It's gravity on machine extra large. I'm sorry. Large with 100 percent CPU and shortage of database connections, right? So I have 100. I'm sorry. 250 connections out of 250 utilized and I have a query of five minutes on that database. Now, clicking on that will allow me to drill down and to see what is happening in real time. on that particular database, right? So you can see that we have a very repetitive query here, but the great part is that you can actually see how the issue started like to compound. So looking 15 minutes ago and grouping by again, we love the group by capability here in Rapido grouping by showing me that the SQL calc found rows function from the parents table ran four times in parallel moving a bit forward in time. We can see that it spiked to 112, and a bit further, we are looking at 230. And if I go back to the live view, You can actually see that currently we have 237 queries running right now with that specific syntax. Now, using Rapido, I can actually do stuff, right? Not only gaining visibility, but also pull the trigger and kill the query that is generating the spike. I know that killing a query is something that you We'll probably think twice before you pull the trigger on. but when you have the ability to do that, using Rapido will actually, be a life saver because you could actually remediate the issue on this boat. Now, as you can see, I'm killing the query, but, my application throw it back again to the database. So So this is not a very useful solution to kill the query right there on the spot. At this point, guys, I want to move to the automation field of Rapido. Let me, let me quickly refresh my page here. I'm going to, I'm going to go ahead and log in again to my, to my application. Let's give it a sec. Yeah, I guess I loaded my database too much. There you are, going down to the automation field, will show me the vast amount of rules that you can deploy using rapid. So I can go ahead and cache query. So using rapid, or you could essentially deploy caching, using table based invalidation. So. I did not copy the query. Let me go go back to rapid. Oh, and copy the query again. so, by the way, as you can see here, it's dropped and I lost connection to the database. So this is why I'm not having data here, but as you can see, it started like compounding again, and now the query amount is growing rapidly. Right. So, so at this point, I just copied the query name. I'm going back to my caching rules, and I can go ahead and, deploy a cash rule, right? A cash rule can, can be can be based on a query or a table. I can catch an entire table or a query. in this case, I would. Probably wanna cache these query. I'm, I'm gonna choose a query and, and, and look at this. without us, we are, we are currently the, only solution, or we enable you to do something that you are unable to do without rapido is, is, is DevOps or, or SRE teams. and it's to deploy a caching rule on your own with no. Business consequences, right? Because if you use table based invalidation, we will actually make sure that the cache data will always be coherent with your database itself. Right? So if I'm going to go ahead and save that now, I have a query cache that will essentially offload that query. Now, that probably won't be the great solution or the best solution for this use case because this query is, he's using current date, and it essentially changes every nanosecond. Right? So, so at this point, I'm moving to scout rules in scout rules. I've already prepared a rule that throttle this query to two. So it will allow this a sequel called found rose, To run only two times in parallel and and look at this, right? I'm I'm using here a wild card on the limit, right? So so it's it's a flat line. Think about it as a flat line using the wild card here that essentially limits that query to run only two times in parallel. Now I'm going back to my my sequel. It remembers the configuration I had. And as you can see, the problem compounds again. But at this point, guys, I'm going to go ahead and pull the trigger, killing that process, and within within a few seconds, you will see that it's gone. And excuse me, and it's not going to spike again, right? So essentially, maintaining a throttling rule will allow me and you can already see it's starting. starting right. So now I have only two instances of the simple calc found roles, right? So my application keeps throwing the query to the database, but now Rapido will allow only two execution in power, right? So you can see it right there going back from 75 to 46 to 2, right? So, yes, this is what Rapido is all about. I hope. This session was interesting. I would love to dive further and show you scout rules and alert rules. So, so using Rapido, you can essentially maintain a low footprint of connections. You can maintain a policies that will make your database more secure. Sane, and, and not spiky. going back to the presentation, feel free to drop me an email at, mat at rapido io. I would love to, extend the conversation and discussion on relational databases in the cloud. So even if it's not related to Rapido, I'm down with any discussion about databases and I really hope to meet you guys face to face soon. So have a great day and take care. Bye-bye.

See all 58 talks at this event!

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Revolutionizing DB OPS with Rapydo

Video size:

Abstract

Summary

Transcript

Matan Nataf

Co-Founder & CEO @ Rapydo

Join the community!

Featured event

2025

2024

Info

Conf42 DevOps 2025 - Online

January 23 2025 - premiere 5PM GMT

Revolutionizing DB OPS with Rapydo

Video size:

Abstract

Summary

Transcript

Matan Nataf

Co-Founder & CEO @ Rapydo

Join the community!