Five quick and useful VM metrics with Prometheus node exporter

When running applications - or VMs - insight is extremely important. In this post I'll show you how to use Prometheus node exporter to create a handful of useful metrics in Grafana.

A quick heads up before we get started: this post assumes you have a Prometheus and Grafana setup already working. If you don’t, you really should! It’s super easy to set up, and once you have it running you can use it to monitor all kinds of applications and hardware.

Setup

Rather than downloading a tarball (as the official guide mentions), on Ubuntu you can simply install it via apt:

sudo apt install prometheus-node-exporter

This also automatically sets up a service, so you don’t have to configure that yourself. You can check if the service is running with:

sudo systemctl status prometheus-node-exporter

Or by making a curl call:

curl http://localhost:9100/metrics

This should return a huge list of metrics about the machine. Believe it or not, we’re almost done! However, before we expose this to the internet, there’s one more thing we should take care of.

Authentication

If we were to simply expose port 9100 to the internet, anyone and everyone could see our metrics. While there might not be any passwords or tokens in there, it does contain sensitive information about your infrastructure that you typically don’t want the whole world to know.

To set the node exporter up to require authentication, we need to create a config file and pass it to the service. Let’s start by creating an empty config file:

sudo mkdir /etc/prometheus-node-exporter
sudo touch /etc/prometheus-node-exporter/web-config.yml

Next, we need to create a password hash. The easiest way to do this is by using the htpasswd command:

htpasswd -nBC 12 alice

Let’s break the command down:

-n prints the results, instead of updating a file.
-B uses bcrypt for hashing the password.
-C 12 sets the bcrypt difficulty to 12. This defaults to 5, which is nowadays considered to be quite weak.
alice is the username you want to use.

You will be prompted to enter a password (twice, for confirmation), and then the command will output a user/hash combination, like:

alice:$2y$12$WI7J4RdlVS3UONW.nQN/e.k6wQJtKKfb0xBPBmk205UPoaLcO9XMu

With this, we can go back to the config file, and store it:

basic_auth_users:
  alice: $2y$12$WI7J4RdlVS3UONW.nQN/e.k6wQJtKKfb0xBPBmk205UPoaLcO9XMu

/etc/prometheus-node-exporter/web-config.yml

Note that you can enter multiple sets of credentials here, should you want to. More information about the config file can be found here.

With the configuration done, we need to instruct the node exporter service to actually use it. By looking at the service definition, we can see how to do that:

[Service]
...
EnvironmentFile = /etc/default/prometheus-node-exporter
ExecStart = /usr/bin/prometheus-node-exporter $ARGS
...

/usr/lib/systemd/system/prometheus-node-exporter.service

The command reads an $ARGS environment variable, which is loaded from the EnvironmentFile. So if we edit that, we can pass arguments to the process. To make it use our config file, we can use the --web.config.file option:

ARGS="--web.config.file=/etc/prometheus-node-exporter/web-config.yml"

/etc/default/prometheus-node-exporter

Now restart the service:

sudo systemctl restart prometheus-node-exporter

And check if it works! An unauthorized curl call should now simply return “Unauthorized”, but an authorized curl call (using -u) should let us see the metrics:

curl -u 'alice:password' http://localhost:9100/metrics

If this doesn’t work, or if something else went wrong, you can check the logs using journalctl (optionally adding the -f flag to follow):

journalctl -u prometheus-node-exporter.service

With that done, let’s turn these walls of metrics into graphs!

Five interesting metrics

The Prometheus scrape configuration is very straightforward, using the basic_auth section to specify our credentials:

scrape_configs:
  - job_name: Node exporter
    metrics_path: /metrics
    basic_auth:
      username: 'alice'
      password: 'password'
    scheme: http
    static_configs:
      - targets:
          - example.com:9100

CPU usage

CPU usage is probably one of the most interesting metrics. This is exposed by the node_cpu_seconds_total metric, which indicates how long the CPU is spending in various modes (idle, iowait, system, etc).

To get the CPU usage as a value between 0 and 1, we can use this query:

1 - irate(node_cpu_seconds_total{mode="idle"}[1m])

Or, if you want to aggregate the data from all cores:

1 - avg(irate(node_cpu_seconds_total{mode="idle"}[1m]))

A graph showing CPU usage.

Memory usage

When you run free -h you’ll typically see something like this:

               total        used        free      shared  buff/cache   available
Mem:           3.8Gi       892Mi       608Mi       4.3Mi       2.6Gi       2.9Gi

The “used” value doesn’t directly have a counterpart in the exported metrics, but the total and available values do. So we simply subtract one from the other to get the used memory:

node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

A graph showing memory usage.

Disk usage

While maybe not directly as interesting as the previous metrics, running out of disk space can still have a big impact on your machine and anything running on it.

The available bytes per file system are directly exported in the metrics. You can exclude the boot filesystems to remove some clutter, since those are not used outside the boot process anyway.

node_filesystem_avail_bytes{mountpoint!~"/boot|/boot/efi"}

A graph showing file system usage.

APT package cache age

The APT package cache age is not directly as impactful as the previous metrics, but is still important to see how up-to-date your system is. This metric is exposed by the apt_package_cache_timestamp_seconds value, which we have to convert to milliseconds in Grafana:

apt_package_cache_timestamp_seconds * 1000

Change the visualization type to “Stat”, and under “Standard options” change the unit to “Date & time / From Now”. This will show the timestamp as a time relative to now, e.g. “one hour ago”.

A panel with the text "22 minutes ago".

Reboot required

Finally, we have the simplest metric of them all:

node_reboot_required

This value is either 1 or 0, depending on whether the system needs to be rebooted. This can for example happen after a new kernel needs to be loaded.

To make it look a little nicer than just the numeric value, change the visualization type to “Stat”, and configure value mappings and thresholds:

Value mappings and thresholds for the reboot status.

A panel with the text "Reboot required".

Conclusion

With these basic metrics set up you’ve created a bunch of insight into a VM or machine, but we’ve only scratched the surface! The node exported has a lot of metrics, most of which we didn’t even use here, so I’d highly encourage you to explore those. Depending on what the node is expected to do you can also get much more specific about what you want to measure, but everything described here can hopefully serve as a solid foundation.