How to monitor your business, and not just your pods

Video size:

Abstract

Your dashboard glows green, but angry users flood support. Revenue drops sharply. Hours later, you uncover a critical bug. 0 alerts. Join me to revolutionize your monitoring. Learn to catch business-killing issues before they strike. Live demo included. Don’t just watch pods—safeguard your business.

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.

Hey guys, thanks for joining in. Today, we're going to talk about a less common approach to production monitoring called business driven monitoring. But before I begin, I want to ask you something. Do you know that exciting feeling of yes, it works. That feeling is why I love developing software. Some people get this thrill from looking at a task pass after multiple failures. But me, I get it from looking at Grafana. Seeing traffic from real users going through a brand new process I created and getting new experience. It's something that gives me tremendous satisfaction. The way I see it, monitoring is the gift wrapper you put on before you ship the feature to production. And I bet some of you are thinking, we don't need to worry about that. Monitoring is out of the box. And in many cases, it's true. In large tech companies like Wix, there are great out of the box production monitoring solutions, but they tend to focus on a single microservice. Or even a single pod. Now, I hope that by the end of this talk, you will see monitoring a bit differently. Instead of monitoring latency and errors, you will switch your focus to monitoring business KPIs. I will guide you step by step in creating your first business driven dashboard. That's how you'll become a true owner of the product, and not just a developer. Hi, I'm Oren. I've been creating Grafana dashboards for the past five years. In the past two years, I've been working as a backend developer at Wix. com. There, I'm part of the Wix e commerce platform, a platform for building e commerce sites and stores online. For those of you who don't know Wix, Wix is a site building platform that powers over 10 million sites across the internet. One million of them are e commerce sites. In terms of monitoring, Wix microservices expose over 150 million unique metrics per hour and half a trillion logs a day. So huge scale, huge global impact. And that's why I come to work every day. When I say business driven, what do I'm not talking about devs doing analyst's work, not at all. Business analysts, they set KPIs and measure them monthly for business growth. But what I'm talking about is end to end monitoring of product funnels, tracking business KPIs to identify crucial bugs that affect users before they notice them. When you look at a single service and you examine its stats like latency and errors, You have no idea if users get the wanted experience from your product as a whole. But if you change perspective and think a bit like a product, you can find metrics that will tell you the full story. Alright, now that we know why it matters, let's get into the how. As developers, we have two main ways to gather live data. Logs and metrics. Logs capture what happened, while metrics capture how well it happened. Logs are great for debugging and tracing specific issues, while metrics are ideal for monitoring system health and performance. They are both trimmed into graphs that compose meaningful dashboards. And there are so many tools out there for any of these purposes. But today we're going to focus on these three, Loki, Prometheus, and Grafana. I know this sounds a bit like characters from a Marvel movie, but from the very beginning of my career, these were the tools I used for production monitoring. They're all open source, so even when I worked at a low budget startup, I ran them on premise with Docker Compose. So you don't have to be a big company to be monitored. Now let's present them. Loki is your log aggregation system. It consumes the logs and stores it in its DB. Prometheus does the exact same thing for metrics. They are both integrated into Grafana, where we create visualizations and alerts. Now that we've got our tools lined up, it's time to put them to work. Let's break down the implementation process step by step. Alright, so the first step is to define the business KPIs and product funnels. Then, we'll light things up with logs and metrics. Alright. And finally, we'll create visualization in Grafana. Imagine walking to the office, you've got the keys, you're in charge. The system is already in production and you want to know how well are we doing. But there is zero monitoring. So how the hell will you know where we stand? You'll start out by mapping your business funnels, identifying crucial flows, breaking them down steps. At this phase, you should already know what you expect the system to do. You should be able to define your business KPIs. This is the part you want your product and analysts on board because they will enrich your tech perspective with more business oriented goals. Each funnel should have at least one business KPI or else why it exists. For example, in e com platform, our main KPI is checkout conversion, but within this huge funnel until checkout conversion, we have more specific per step KPIs. So now you know what you expect the system to do. But more importantly, you should notice what you're not seeing, what data is missing, where are your dark spots. And that's where it gets interesting. Because once you got the funnels, you can start to get intimate with every step. If you see a dark spot, you can light it up with matrix. Now don't stick to what could go wrong and the error flows, but think about your business KPIs. Think about what will make you happy with the product. Light up those paths as well. My advice here is to be generous. Add so many metrics, so you'll feel like you know every corner of the flow. For those of you who never used Matrix, once you did the hard work of integrating Prometheus and Grafana, adding one is super easy. You only need one line of code and one line in Grafana, and you got it. I repeat, one line of code and one line in Grafana. Mind blowing, right? Behind the scenes, there is a library that takes this log of Matrix and translates it to low key and Prometheus clients. But also without a proper infrastructure in your company, there are super friendly client packages that will make your production ready in no time. Prometheus, for example, has client libraries in almost any programming language that exists. So now that we've got a metric set up, let's visualize the data. That's how it looks in the most basic way. What you see here is a time series that shows the value of the metric per minute. But Grafana offers so many visualizations and unitypes. Feel free to mix and match till you find the right ones for your specific step. Now, I know it looks a bit intimidating, all these graphs, but in a few minutes, we'll dive into the demo and you'll see how easy that is. Okay, so up until now, we created a dashboard that is measuring KPIs in our business funnel. Before we jump into the demo, I want to share with you some other tips to light things up, and I will do it by a real example. In the Wix delivery system, we serve as a platform, so we call delivery carriers like USPS, for example, to get real time delivery options and costs. Now, if you remember earlier, I said that the Wix e commerce platform main KPI is checkout conversion. But within checkout, there is a delivery step where buyers choose how they want the product to be shipped to their house. Now having delivery options is crucial to completing checkout. So we must monitor that USPS returns good options to the buyers. So we need to let up this flow. We need to add metrics. But when we did that, we got a very blurry picture. Because we saw success, but we don't know which delivery carrier succeeded. We saw a failure, and we don't know which one failed and why. And that's where labels come in hand. Once we added the identifier of the carrier as a label, we got in seconds, dozens of flows monitored. One per delivery carrier. What down below is identifiers internally to different delivery carriers like USPS and pickup locations. And that's the power of labels. They allow you to expose multiple metrics from a single one with different labels. And all I need to do is to add the by and the label name at the end of the Grafana query. Labels are super useful when it comes to an open platform because every site on Wix can have different delivery carriers installed. And I need to be open and flexible. But at the same time, this is a crucial business flow. So I must be monitored and labels allow me to do just that. So I guess I convinced you that labels are great, but with great power comes great responsibility. Labels can make your service blow up. Every new value you pass to a label actually generates a new metric. And the more metrics your service exposes, the harder it works on just being monitored. So you need to make sure you control the values of the labels you're passing. Because if you incidentally pass an infinite label, your service will exponentially expose more and more metrics. And at a certain point in time, it will work too hard and it will just stop being monitored completely. And later on, it will stop working completely, which is crazy. So we want to keep our microservices safe. We want to control the metrics we're exposing, and if you can see above, Wix actually exposes a monitoring alert on the number of metrics we use for monitoring, which is a monitoring cannibalism, if you ask me, but it's for this very reason we want to safeguard our microservices. But at the same time to monitor them properly. Okay. It's time for a demo. We're going to dive into the system I've been working on in the past year, the Wix delivery system. Our goal is to allow stores to offer diverse delivery options for each of their products so that every buyer will find the right one for their needs. And I'm going to walk you through its main business flow. The first step is where a buyer opens the checkout page and wants to buy a product that will be delivered to his house. The next thing we want to know is how many options the store offers. Do they offer pickup, local delivery, maybe USPS shipments? We want to see how diverse the delivery carriers that operate in the store. Because the more diverse they are, the higher the chance the buyer will find one that suits him. The next thing we do is actually call these delivery carriers through API and get from them the options and the costs. So we want to see that they display good options at checkout and that they don't fail. And finally, we want to see them click. We want to see buyers completing the delivery step at checkout and continuing to payment. So what we're going to do is we're going to create four steps and four visualizations. In Grafana, but before we do it, there are some things we need to know. What we're actually going to do is we're going to query on top of Prometheus DB. And Prometheus offers some functions and some special syntax. So what we're going to use is we're going to use sum to sum the metric values across different pods in a production. Because the microservice and Twix is deployed on multiple pods and all of them expose the same metric. But for our use case, for business driven monitoring, we don't actually care about a single pod. We care about the business flow. So we want to monitor them all together. The next thing we'll use is M1 rate, which is the one minute average of the metrics we're getting. This average will help us see graphs more smoother and will enrich our perspective. Okay, the next thing we'll use is the Wix internal labels. We'll use two of them, service and method. Service is the microservice that exposes this metric, and method is the specific endpoint that exposes this metric. Alright, I guess we're ready. Let's dive into the demo. Alright, so we're starting from a blank page, an empty dashboard. And we'll create our first visualization. We're going to use Prometheus as a data source, and we're going to do the first visualization, which is the top of the funnel where a buyer creates checkout and wants delivery, and we'll measure it by the calls to the delivery system entry point, which every time a buyer creates a checkout and wants something delivered. It triggers a call to this endpoint. So the traffic will signify the amount of buyers that want delivery. Okay. So let's start, we'll do service, which is the microservice that we're working on and this microservice is called delivery rates and what we're going to do is we're going to use the method. Which is the entry point to the entire delivery system. It is called get delivery options. And now if I run just that, you can see, I got a lot of pods that are sending the same metric, but I don't actually care about different pods in my production. I want to see the entire microservice as a whole. So I'll use the M1 rate we talked about and the sum, and now I'm seeing a clear graph. Side note on that, don't mind the numbers because we're using test data. On the left hand side, there is a number of calls and It's a time series. So you see it throughout the time, six hours. So the next step is the delivery carrier diversity. We want to measure how many carriers are available for each request we get. So we're monitoring the same microservice. And we'll use now a custom metric. I defined my code. It is called active delivery carriers for requests. And in this metric, I actually calculate how many delivery carriers are active pair requests that I get from checkout. And again, I'll use the M1 rate and the sum. But if you think about it, I don't actually care about the trend here. I care about the value. How many active delivery carriers do I get per request? So I'm going to switch from a time series to stats. And here I can decide how this value will be calculated on the right hand side. There is calculation and currently the default is the last value, but I don't want the last value. I want the average. Across this time frame. So we'll use mean instead. The next step, we're going to check how many delivery carriers display options at checkout. And the way for us to measure it is to take the API calls we do per carrier and to check how many successful there were. So we're monitoring the same microservice. And we're using the custom metric again, that is called successful delivery rate SPI calls. So we'll do the same thing, M1 rate, and then sum. So again, sum is sum across pods, and M1 rate is one minute average. But if you remember, when I talked about labels, I talked about exactly these kinds of graphs. Because now I get a blurry picture. I don't know which delivery carrier displayed at checkout. So I want to break it down by carrier. So I'll add the label by, and carrier is my label that I wrote in my code. And now I get multiple graphs, one per delivery carrier. These IDs you see below are identifiers internally in our system to different carriers, for example, to USPS. Or to pick up locations, etc. Cool. Now, let's do the same thing for failed carriers, which I think personally is more interesting because successes are nice, but if I see a failure, I can do something about it. Okay, so we'll change the tile to Delivery carrier failed to display, display options on checkout. And now all we need to do is to change the metric, because I have the same label there. So we'll, oh sorry about that, something went wrong. Okay, again, failed, false, nice. And now I get the same graph, just for failures. And as you can see There is a certain app that fails pretty frequently, and that's something I can act on. I can talk to this app, I can debug the issues, which is super powerful because it actually affects the bottom line. Alright, the last bit, the actual delivery step conversion. So for this one, I'm not going to use regular from exist metrics. I'll use ones that are used by our business analysts. So what you're seeing here, it's a bit different syntax, but it's the same thing. So I have that delivery method completion in checkout and I divided by the contact details, which is the previous step at checkout. So it's a good approximation of how many buyers actually go through filling up the details and then continue to delivery and find. A good option and then continue to payment. All right, we did it. It took us a few minutes and we actually visualize the entire business funnel of the delivery system. I hope you're excited because I am now let's continue. Now that we've got a kick ass dashboard, there are some things that can make us more alert and attentive to it. The first one, you got it, right? Alerts in a week or two, everyone will forget you built this dashboard. The most proven way to make it relevant. is to integrate with your regular production on call alerts. Here you can see I added an alert to one of the graphs we created in the demo. The alert will be sent to our on call channel with this custom message I defined below. Now remember, our goal in this talk is to find crucial bugs that affect users quickly and solve them with minimal impact. And alerts, they help us to do just that. To identify the issue before users start complaining, troubleshoot quickly, and find the root cause. So now that we have actionable alerts, the most common thing we do, straight after we jump, is to investigate deeper. And a great thing you can do with Grafana, It's to provide an investigation dashboard to really drill down because you start from a business issue and now you need to understand it and quickly resolve. So if you have a dashboard with pre made investigation, queries, filters, and latest logs, it can make you super quick in finding the issue and actually solving it faster. So just to quickly recap, ownership is a choice. For me personally, it made development much more interesting. Now, if you are ready to ownership, monitoring is the ultimate tool to help you get intimate with the product and lead it in the right direction. To monitor a business, you should start out by mapping your business funnels, then light them up with metrics and logs. So you'll know every corner of the flow and you can quickly troubleshoot every issue then add alerts. So you'll know in time that an issue arises, and that's how you become a true owner of the product and not just a developer. Thank you so much for listening. Feel free to reach out and enjoy the rest of your conference. Bye.

Slides

Download slides (PDF)

See all 32 talks at this event!

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

How to monitor your business, and not just your pods

Video size:

Abstract

Summary

Transcript

Slides

Or Rozenzweig

Backend Developer @ Wix

Join the community!

Featured event

2025

2024

Info

Conf42 Kube Native 2024 - Online

September 26 2024 - premiere 5PM GMT

How to monitor your business, and not just your pods

Video size:

Abstract

Summary

Transcript

Slides

Or Rozenzweig

Backend Developer @ Wix

Join the community!