Five quick and useful VM metrics with Prometheus node exporter
When running applications - or VMs - insight is extremely important. In this post I'll show you how to use Prometheus node exporter to create a handful of useful metrics in Grafana.
A quick heads up before we get started: this post assumes you have a Prometheus and Grafana setup already working. If you don’t, you really should! It’s super easy to set up, and once you have it running you can use it to monitor all kinds of applications and hardware.
Setup
Rather than downloading a tarball (as the official guide mentions), on Ubuntu you can simply install it via apt
:
sudo apt install prometheus-node-exporter
This also automatically sets up a service, so you don’t have to configure that yourself. You can check if the service is running with:
sudo systemctl status prometheus-node-exporter
Or by making a curl call:
curl http://localhost:9100/metrics
This should return a huge list of metrics about the machine. Believe it or not, we’re almost done! However, before we expose this to the internet, there’s one more thing we should take care of.
Authentication
If we were to simply expose port 9100 to the internet, anyone and everyone could see our metrics. While there might not be any passwords or tokens in there, it does contain sensitive information about your infrastructure that you typically don’t want the whole world to know.
To set the node exporter up to require authentication, we need to create a config file and pass it to the service. Let’s start by creating an empty config file:
sudo mkdir /etc/prometheus-node-exporter
sudo touch /etc/prometheus-node-exporter/web-config.yml
Next, we need to create a password hash. The easiest way to do this is by using the htpasswd
command:
htpasswd -nBC 12 alice
Let’s break the command down:
-n
prints the results, instead of updating a file.-B
uses bcrypt for hashing the password.-C 12
sets the bcrypt difficulty to 12. This defaults to 5, which is nowadays considered to be quite weak.alice
is the username you want to use.
You will be prompted to enter a password (twice, for confirmation), and then the command will output a user/hash combination, like:
alice:$2y$12$WI7J4RdlVS3UONW.nQN/e.k6wQJtKKfb0xBPBmk205UPoaLcO9XMu
With this, we can go back to the config file, and store it:
basic_auth_users:
alice: $2y$12$WI7J4RdlVS3UONW.nQN/e.k6wQJtKKfb0xBPBmk205UPoaLcO9XMu
/etc/prometheus-node-exporter/web-config.yml
Note that you can enter multiple sets of credentials here, should you want to. More information about the config file can be found here.
With the configuration done, we need to instruct the node exporter service to actually use it. By looking at the service definition, we can see how to do that:
[Service]
...
EnvironmentFile = /etc/default/prometheus-node-exporter
ExecStart = /usr/bin/prometheus-node-exporter $ARGS
...
/usr/lib/systemd/system/prometheus-node-exporter.service
The command reads an $ARGS
environment variable, which is loaded from the EnvironmentFile
. So if we edit that, we can pass arguments to the process. To make it use our config file, we can use the --web.config.file
option:
ARGS="--web.config.file=/etc/prometheus-node-exporter/web-config.yml"
/etc/default/prometheus-node-exporter
Now restart the service:
sudo systemctl restart prometheus-node-exporter
And check if it works! An unauthorized curl call should now simply return “Unauthorized”, but an authorized curl call (using -u
) should let us see the metrics:
curl -u 'alice:password' http://localhost:9100/metrics
If this doesn’t work, or if something else went wrong, you can check the logs using journalctl
(optionally adding the -f
flag to follow):
journalctl -u prometheus-node-exporter.service
With that done, let’s turn these walls of metrics into graphs!
Five interesting metrics
The Prometheus scrape configuration is very straightforward, using the basic_auth
section to specify our credentials:
scrape_configs:
- job_name: Node exporter
metrics_path: /metrics
basic_auth:
username: 'alice'
password: 'password'
scheme: http
static_configs:
- targets:
- example.com:9100
CPU usage
CPU usage is probably one of the most interesting metrics. This is exposed by the node_cpu_seconds_total
metric, which indicates how long the CPU is spending in various modes (idle
, iowait
, system
, etc).
To get the CPU usage as a value between 0 and 1, we can use this query:
1 - irate(node_cpu_seconds_total{mode="idle"}[1m])
Or, if you want to aggregate the data from all cores:
1 - avg(irate(node_cpu_seconds_total{mode="idle"}[1m]))
Memory usage
When you run free -h
you’ll typically see something like this:
total used free shared buff/cache available
Mem: 3.8Gi 892Mi 608Mi 4.3Mi 2.6Gi 2.9Gi
The “used” value doesn’t directly have a counterpart in the exported metrics, but the total and available values do. So we simply subtract one from the other to get the used memory:
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
Disk usage
While maybe not directly as interesting as the previous metrics, running out of disk space can still have a big impact on your machine and anything running on it.
The available bytes per file system are directly exported in the metrics. You can exclude the boot filesystems to remove some clutter, since those are not used outside the boot process anyway.
node_filesystem_avail_bytes{mountpoint!~"/boot|/boot/efi"}
APT package cache age
The APT package cache age is not directly as impactful as the previous metrics, but is still important to see how up-to-date your system is. This metric is exposed by the apt_package_cache_timestamp_seconds
value, which we have to convert to milliseconds in Grafana:
apt_package_cache_timestamp_seconds * 1000
Change the visualization type to “Stat”, and under “Standard options” change the unit to “Date & time / From Now”. This will show the timestamp as a time relative to now, e.g. “one hour ago”.
Reboot required
Finally, we have the simplest metric of them all:
node_reboot_required
This value is either 1 or 0, depending on whether the system needs to be rebooted. This can for example happen after a new kernel needs to be loaded.
To make it look a little nicer than just the numeric value, change the visualization type to “Stat”, and configure value mappings and thresholds:
Conclusion
With these basic metrics set up you’ve created a bunch of insight into a VM or machine, but we’ve only scratched the surface! The node exported has a lot of metrics, most of which we didn’t even use here, so I’d highly encourage you to explore those. Depending on what the node is expected to do you can also get much more specific about what you want to measure, but everything described here can hopefully serve as a solid foundation.