Conf42 DevOps 2025 - Online

- premiere 5PM GMT

15 NGINX Metrics to Monitor

Video size:

Abstract

Let’s take a look at some of the basic NGINX metrics to monitor and what they indicate. We start with the application layer and move down through process, server, hosting provider, external services, and user activity. With these metrics, you get coverage for active and incipient problems with NGINX

Summary

Transcript

This transcript was autogenerated. To make changes, submit a PR.
Hi. And welcome to this presentation on 15 Essential Metrics for NGINX. I'm Dave McAllister. I'm the Senior Open Source Technologist here for NGINX, which is part of F5, and I've been involved in the open source scene since longer than it's been called open source. I'm glad to be here at the COM 42 DevOps presentation to talk about a project that's incredibly important in today's internet world. Nginx, of course, is one of the most popular web servers, and as you can see from the Netcraft survey, we're up there running about 39 percent of all of the state of computers as of November. Netcraft runs this monthly. I don't have the latest numbers for December at my fingertips, but not only that, not only does Nginx support HTTP servers, indexes files, HTTP 2, HTTP3 and supports SSL, TSL, SNI, FastCache, etc. It can also be reverse proxy. It can be a load balancer. And in fact, as a load balancer, it's also supports in band health checks. But many things can impact the performance of NGINX. You can see an increase in request. You can see higher disk throughput or network throughput. You might be starting to run out of resources like CPU or memory, or So, your applications, your web applications may actually have problems in them as well. Monitoring is essential. In our DevOps world, we need to be able to look and understand what's happening at all times with our applications, as well as how we are feeding that to our client base. And let's start by talking about collecting the data. Now, Nginx has two categories. characterizations of projects. One is the Nginx open source core project, and the metrics there are gathered through this, this thing called status stub module. And over on the right, you can see basically you add this single line. to your nginx. conf file, and you now will have the ability to grab those metrics. There's eight metrics plus a health check inside of here. Nginx Plus, which is a commercial project, so I'm not as familiar with it, adds an API endpoint to query specific metrics, and it has a lot of metrics, somewhere around 130 or so. And they get into far more details and drill into more things. Both these offerings have the ability to scrape logs so that you have even more capability of pulling metrics and pulling data out of those. And while your choice of visualization is left up to you, we use Grafana heavily internally. The NGINX Plus project does have some built in dashboard capability. And with that, let's step to the next part of this. You're going to hear me hammer this, but the most essential metrics You need to track, depend on your specific goals and needs. This is a set of metrics that will cover a lot of general cases, but you may need to go deeper into certain characteristics. But keep in mind, these metrics are not an absolute must do. They fit to a very common general purpose case of running NGINX. Your mileage may vary. aligning these things with the various, capabilities, we have request metrics. We can look at those as, how do we get things, what's happening going to the server? Response messages, how is the server feeding back to it? And so on down into things like caching metrics, SSL metrics, because they do take performance hits, as well as security metrics. Security is incredibly important in today's internet. And so all these things feed together to help us understand what's happening with each of these pieces. So with that, let's take a look at the next set. Let's start with active connections. Obviously, this metric indicates the number of currently active connections in your server, but it helps you understand the load that is being placed on your server. It helps you start looking at things like, does it scale your resources? Are your configurations running optimally? Do you need to handle traffic more efficiently? It can lead to things like understanding load balancing structures. How many active things are coming in as a baseline for what you need to do to make NGINX perform efficiently. NGINX was built on the concept of performance and efficiency all the way back when it was undertaken to solve what was then known as the 10k problem. With this, there are two variables. The first one, connections active, is in nginx open source. You can also, if you're using nginx plus, use the connections, underscore, nginx underscore connections active, which is an API input to which you can grab metrics. All of these metrics I'm talking about, with the exception potentially of the health check, are most used in a. Strip chart model. In other words, a time series display aggregated together so that you can best understand what's happening to the underlying system at any given moment request rate, which is the next piece inside of this request rate. Obviously, it's the number of requests that are being handled per second. So this starts getting into not just the connections. How many things are per second? pinging you at a time. This is how many things are being asked of you at a time. And it helps you understand your traffic problems. Most of the time, you're going to see this in a time series display as a sawtooth. During the day, things will go up. During the night, things will go down. But not always. And because of this, You can start looking at how to balance both your resources and your load balancer against your traffic patterns, but sudden spikes in request rates can indicate certain things that you need to be aware of, such as a traffic search. Did marketing decide that they're going to start offering a new special between the hours of 3 a. m. and 6 a. m.? Is there an attack going on against your system? Has somebody launched a DDoS attack against you? And with this, in both cases, the variable of choice would be requestTime. And requestTime is going to show you, basically, the number and how many are being used on a per second basis. Important metric to have, because it is the thing that is driving the engine. So response time, so if you think about this, the one before request rate, response time says how long is the server taking before it sends things back out. So response time metrics are not just the process of your request, but also if it processes and sends something through to an upstream server, In other words, you're using a microservices approach. You've got another application. You're pinging something else for data, a database. This is going to indicate to you how long it's taking before it is ready to send a response back to the client. This lets you understand how well your applications are running, as well as start identifying bottlenecks, particularly in a microservices basis. So once again, it's not just the serve web server, it's also the upstream servers. And again, here you're going to want to combine request time, how long things are taking, with response time. How long do things take when they're coming in? How long do things take before they go back out? And you can think of this sort of as an endpoint measurement in a microservices, viewpoint. We now know how long between activities that are happening. And again, in open source, use request time and upstream response time to give you those two pieces of information. As much as we'd love to be able to say everything was perfect, there's always errors. That's why we have so much focus on our DevOps monitoring environments. It's because something is likely to go wrong. I used to do a talk on the Murphy's Laws for observability, and that's one of the things that's always very true. Something will always go wrong, probably at the worst possible time. So monitoring error rates, in this case, is mostly looking for 400 and 500 status codes. And it's crucial for identifying those errors that exist with servers or with your applications. Again, keep in mind that most of our web environments are no longer just sending static pages out. They're connecting to applications, and they're connecting to databases, they're connecting to other things behind this. However, when you start seeing a high error rate, it can mean a lot of different things. But the place to start checking, very honestly, is the denial of service attack, to make sure that you aren't being attacked from the outside. Look at application bugs. Is the application doing something wrong? Or do you simply have a mis configuration, something that's causing a breakage between the incoming client. and the server, or as we return that information. And here status is your variable of choice. Status will give you that particular set of information. We do still have to consider our underlying resources, CPU and memory usage. NGINX uses resources And even though it is designed to be small, efficient, and highly performant, you may still need to optimize the resources, especially as loads grow. And so looking at these can start telling you what your resource optimization, whether you need hardware upgrades, as well as helping some of the planning aspects. Here with open source, we recommend using system monitoring tools, top H top VM set and NGINX plus. provide specific endpoints to allow you to look at what process CPU is being used by NGINX and process memory is being used by NGINX. But in any case, it's important to look at what the underlying structure is as you are building these out. SSL handshakes are also equally important. And in fact, in many ways, the SSL handshake is something that But we can't live without these days. So the SSL handshake, for instance, lets us start looking at how well we are driving things as they go through. The long handshake time usually means there's a configuration problem. And so if something's taking a long handshake between these things, it can also mean that your hardware is running slow. And that will be indicated by the previous one looking at CPU and memory. Nginx, Open Source, and Plus both provide variables to look at that handshake time frame. That's incredibly crucial these days. We need these secure connections. This will let you know that your secure connection is not slowing down your requests and responses. And then, as we move through, throughput. to quote John Mashey, one of my favorite computer scientists, bandwidth lives forever, or, sorry, latency lives forever, throughput costs money. And so if we overscale or underscale, it's costing us money in one ways. So throughput measures the total amount of data that is sent to clients. And it helps us understand our data flow. That data flow is crucial because we have to be able to handle that required bandwidth at peak. And so understanding our peak loads in terms of connections, request rates, and throughput. mean that we can understand how best to balance our resources, particularly for instance in a Elastic environment, be it a Kubernetes environment or something else that can scale up and down based on load. Keep in mind that throughput is also one of your limiting factors. Your limiting factors are going to determine the ultimate point of contact to what you are. ByteSent for Nginx open source, plus will give you this information. I don't know if it's nginx, open source, or plus will give you that information so that you can best understand what's going on in the throughput basis. And then finally, blocked request. I'm thinking of this is finally maybe not blocked request can help you identify. potential security threats. You have to worry about things that are blocked. If you see a high number of blocked requests, you got possibly an ongoing attack, somebody trying to penetrate your system, or you may have a bunch of just unauthorized access attempts. This does not always mean some outside malicious actor. It can be as simple as configuration error or security policy being a little over tweaked for this. Status again will tell you that your blocked request is going here. With Nginx Plus, you can get a specific security blocked request count. And those will help you understand, really, whether you're going to need to break off the attack or to revisit your security policies. Oh, and cache hit ratio. If you're using Nginx as a reverse proxy, you really do want to have it hit the cache as much as possible. Far more efficient, faster, more performant, better, all of the above for this. However, if you're getting a high cache rate hit, it means you're getting lots of things from the cache. That's good. That reduces the load upstream. It means your response times are faster. And we know that users really only care about their request. Whether it succeeded or failed, and how long it took. And so the more you can cache, the happier your users are likely to be here. Upstream Cache Status in Nginx Open Source will tell you that. So there are these great metrics. So what? They're just a bunch of numbers. So let's take a look at some of the ways that these metrics might actually apply and indicate it. let's take a first scenario on monitoring active connections, and you've got a spike. You can see baseline active connections about 500, normal distribution varies by about 10 percent up and down, but all of a sudden you've got a spike that's hit to 1500. So your spike is substantial, and your spike is probably going to be a serious outlier condition. Is it a traffic search? Did marketing start doing a special? Did, for you, those of you who are, long enough back to remember Slashdot, maybe you got Slashdotted? All those things suddenly mean that you're going to see a lot of activity. And because you're now getting more active connections, it's likely that your request rate is rising, or your response rate is going to be slower, and possibly even your error rate. You don't have the same thing. And those are all caused by an active connection spike. But, it could also indicate a DDoS attack. Crosscheck your 4 0 4 and four oh threes to see what's going on with those error cases. You might also suddenly be running out of limitations. Maybe something has happened that your load balancer is trying to do some weird things in re reflecting things to a single system rather than multiple systems, and so your spread is not quite the same. So you may need to look at resources that are underlying this structure. Scenario 2, also in Active Connections, a gradual increase. Same baseline, same normal operation. But now, all of a sudden, your numbers are increasing. Oh, and sorry, there's a typo. I left out a zero on the one. 100 should be 1000. And the request rate goes up. Equivalently, the response rate goes up. The error rate goes up. Maybe this is a good thing. Maybe you're growing the number of people that are using your system. That's a good thing, which means that you need to look at planning and scaling your resources. Here you can use the gradual increase, looking at throughput, looking at requests, looking at response times, to start planning What you need to adapt to add the resources necessary. Maybe your systems are getting saturated. Maybe you're running out of memory on a subsystem and it can't handle the load and therefore it's slowing down, which is going to slow down everything else. So you need to look at whether or not your underlying system structure is capable of meeting the structures. And then if you're in a ephemeral environment where you can be elastic behavior, are you scaling correctly? Because are you scaling to such a point that you're not scaling an application microservice correctly? Or maybe your load balancer isn't recognizing new services as they come on. And then. Also in connections, all of a sudden things are just bouncing all the heck over the place. You don't quite understand, but you look at this thing and your connections are going, 400, then they double, then they drop down, then they triple, and then they go back to the mean, and you're all over the board. And each of these things, by the way, does tend to impact the other things. More connections usually means a higher request rate, which means usually more response time, which can mean an increase in errors. But, you've got intermittent traffic. Where is the intermittent traffic coming from? Again, is it something periodic? And you will see these slopes. All of a sudden, somebody has decided that they need to run a report and it's hitting your system and eating all of the resources talking to your data repositories here. Or, you've got a bunch of batch work that's suddenly being kicked off for whatever reason. Or, again, you've been slash dotted or you've got a marketing attempt. But something is causing the activity going here. It doesn't necessarily mean an outside force. It could be a load balancing issue. Your load balancer needs to be looked at the configuration so that you can make sense that the load balancer is handling the traffic appropriately to be able to pass the traffic to the resources as necessary. And then Your applications can bottleneck, and remember those bottlenecks can be outside of your application. Likewise, in, for instance, a Kubernetes environment, you may actually be looking at your application being perfect, scaling correctly, but being put into a node, a pod, that has a noisy neighbor problem, who's eating all the memory, even though your application is perfectly correct. Your application will show up as the problem, not the one that's eating all the memory. Look at your underlying structures and underlying resources. So let's now take a look at the other one that I talked about, a blocked request. And the reason I want to talk about blocked requests is that quite often, it has to deal with the security aspects here. Active connections, but you can see down here, your blocked requests are running about 5 per minute. That's actually, not too out of the ordinary. Maybe even a little low at times here. And you're running 4 to 6. Regular basis for operations and all of a sudden your blocked request scale by a factor of 10. There is something going on and this is usually one of the three things. It's a security threat. Somebody is attacking your system and here go back and looked at that status issues to see what's going on and seeing if you're getting a lot of failed authorizations. inside of here. You may have changed something in your firewall. You may have changed a WAF configuration. Check for recent changes because those recent changes may be blocking things that are false positives. And those false positives not only can be from a configuration change. But maybe you've set up a whole new series of security rules that are looking more deeply at the incoming connections. If you have security rules, you've also got configurations for those. And so blocked request spiking is usually some level of security threat. but it can also be generated by internal problems or internal concerns. Same thing for a gradual increase. You might have become a target for attackers. You may have just suddenly shown up on somebody's radar and they are beginning to scale up. And a tag here up your security. Again, looking at the other features and other functionalities, your requests are going up here. Is your status still staying the same. Your security rules can play a role here, but it can also be a bot approach that somebody's botting you to death for this or again, your resource. So this could be suddenly slowing things down. You or not accepting things the same way they are. Remember that blocks still take resources. So as your blocks start going up, then all of a sudden your performance may decline because you're no longer capable of meeting the needs of these things. lots of different pieces that can happen. But it's very important to understand this gradual increase. And then, high variability. Similar structure. Bouncing the heck all over the place. Bouncing all over the place. Blocks request 5 to 30 to 10 to 40 to 5. This is an intermittent attack, more than likely. Somebody is just probing. And it could be a malicious actor. It could be an automated script. It could be botting coming in to you. But nonetheless, something from the outside is probably doing this. And it's most often seen through load balancer or proxy issue. Because of a blocking structure, you end up with an inconsistent load. If you're blocking, you're not passing through. And therefore, your load balancer is not quite sure how to load balance because it's got a lot of things coming in, but not all of them are being passed through here. In this case, you might want to take a look at what your load balancer structure looks like in its configuration basis. You can also look for application misconfigurations or Maybe you have some vulnerabilities that somebody is taking advantage of just every now and then hoping to slide underneath you. While I've talked about metrics, I also want to bring up, OpenTelemetry. OpenTelemetry module for NGINX is an open source project that brings the OTEL tracing environments through NGINX. Fully supports W3C trace, OTLP, gRPC formats. And we did this because NGINX's number one concern is performance and efficiency. And we looked at the current modules that were out. Out there. We found that most of them had a 50 percent hit on request performance, and that was unacceptable. So we wrote our own native module performance hit. It's about 10%. Tracing always hits your performance. Sorry guys, it does for this, but 10 percent is not too bad. Now, the nice thing is that this is set up inside of your Nginx application configuration, so it doesn't require any separate structures or work. Inside of here. It allows you to dynamically control traces. You can use cookies, you can use tokens, you can use variables. And that tracing means you can turn it on and off, scale it up and down as you need. And so without stopping your environment, you can actually control your tracing throughput. short answer, the client request comes in, the request goes into the application that feeds back to NGINX and NGINX sends the request back. NGINX is the open telemetry collector which via OTLP can send it off to an OTEL collector which then can go into your web app, it can go into your storage, it can be displayed by. Almost everything these days because everybody is using open telemetry. Likewise, open telemetry now supports metrics and logs, and you will see those also becoming available, as part of our expansion into, more monitoring capability for the NGINX, open source core and NGINX plus. So here's a quick example. of the configuration, load the module into it, you set up the exporter point, which is basically you're talking to the OTEL collector, you tell it to listen in on it, and then you say, I want to see the traces. TraceInject, and ProxyPaths. So each of these pieces, and it has the ability to do sampling structures of all sorts, and because you can enable and disable via a variable, you can have it on and only turn it on when you need it. The other side is Prometheus. And Prometheus is pretty much de facto metrics structure. Prometheus, Nginx Prometheus exporter is an open source exporter that takes Nginx metrics into a form that Prometheus can scrape. And it reads either the open source side, through the status module, or it can read the PLUS API. You can see, find it there. It supports the open source core, the PLUS product, but it also supports the two Kubernetes projects in Gridscape Controller and Gateway Fabric. And here you can see things like, I can now see the types of things, how it's set up, and this allows me to bring, again, metrics into a common vernacular so that they can be displayed through your choice of tools. concluding, we've looked at some key metrics for Nginx. Remember, they are Basic and generic is structure. It's a starting point. Your mileage may vary. The things that are important to you are dependent on what your outcome, desired outcome is. We looked at some scenarios, a couple of them around haptic connections as well as blocked connections here. And We did underlie some of the hardware concerns and software concerns. In a microservices environment, it's not just enough to know that your server, web server, is working well. You also need to know that the upstream servers are working well in conjunction with your web server. And it's not just the metrics. My favorite quote, and I'm going to still use it here, the most effective debugging tool is still careful thought. Coupled with judiciously placed print statements. Today, those print statements are our metrics, our traces, and our logs. They give us the ability to use our own insight and our own understanding of our environments to make sure that what we deliver is what our clients and our users are looking for. And with that, My thanks for, joining in today. You can find docs on performance for open source, as well as plus, here. My slides are always on GitHub and check me out on LinkedIn. I also mentioned two open source projects that you may be interested in. The Nginx Prometheus Exporter and the Open Telemetry Module for Nginx, both which are available through our GitHub. And with that, thanks again, and enjoy the rest of COM 42 DevOps.
...

Dave McAllister

Senior OSS Technologist @ NGINX

Dave McAllister's LinkedIn account Dave McAllister's twitter account



Join the community!

Learn for free, join the best tech learning community for a price of a pumpkin latte.

Annual
Monthly
Newsletter
$ 0 /mo

Event notifications, weekly newsletter

Delayed access to all content

Immediate access to Keynotes & Panels

Community
$ 8.34 /mo

Immediate access to all content

Courses, quizes & certificates

Community chats

Join the community (7 day free trial)