This blogpost is a follow up on my previous post about setting up a cluster, if you haven’t read the previous ones, I strongly suggest to read them first:

In this series of blogposts, I will explain how I configured my homeservers as a Nomad cluster with Consul as a DNS resolver for the cluster nodes and services.

This cluster is monitored using Prometheus and Grafana. This allows me to see in detail which nodes are operational, how high the workload is, etc.

Enabling telemetry in Nomad

Nomad doesn’t expose the Prometheus telemetry data by default. We can enable this functionality by editing the configuration file of each Nomad agent you want to monitor.

Add the following stanza to your configuration file:

telemetry {
    collection_interval = "5s"
    disable_hostname = true
    prometheus_metrics = true
    publish_allocation_metrics = true
    publish_node_metrics = true
}

And restart the Nomad agent:

sudo systemctl restart nomad

The metrics are now available at: https://<IP>:4646/v1/metrics?format=prometheus You can test it out using curl which should return a JSON object with the measured metrics.

Setting up Prometheus

Download Prometheus for ARM from the Prometheus download page: https://prometheus.io/download/

Create the following configuration file:

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['leader:9090']
        labels:
          group: 'production'

  - job_name: 'nomad'
    scrape_interval: 5s
    metrics_path: '/v1/metrics'
    tls_config: # TLS certs we configured previously for Nomad
      insecure_skip_verify: true
    scheme: https
    params:
      format: ['prometheus'] # Specify ?format=prometheus
    static_configs:
      - targets: ['<IP 1>:4646', '<IP 2>:4646'] # Specify nodes here, you can also use Consul services
        labels:
          group: 'production'

Add a systemd service file (/etc/systemd/system/prometheus.service) to run Prometheus at boot:

[Unit]
Description=Prometheus Time Series Collection and Processing Server
Documentation=https://prometheus.io/docs/prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=<INSTALL PATH>/prometheus \
    --config.file <CONFIG PATH> \
    --storage.tsdb.path <STORAGE PATH> \
    --web.console.templates=<INSTALL PATH>/consoles \
    --web.console.libraries=<INSTALL PATH>/console_libraries
KillMode=process
KillSignal=SIGINT
LimitNOFILE=infinity
LimitNPROC=infinity
Restart=on-failure
RestartSec=2
StartLimitBurst=3
StartLimitIntervalSec=10
TasksMax=infinity

[Install]
WantedBy=multi-user.target

And enable it:

sudo systemctl enable --now prometheus.service

Prometheus should be available at: <IP>:9090 in your browser.

Setting up Grafana

Install Grafana using a PPA and APT:

# Add stable PPA
sudo apt-get install -y apt-transport-https
sudo apt-get install -y software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list 

# Install
sudo apt update
sudo apt install grafana

A systemd service file will be installed together with Grafana, you can edit the service file if you want to use a different configuration path than the default one. You can find a complete installation guide in the Grafana docs.

If you surf to <IP>:3000 with a browser, you will get the Grafana login screen. Grafana will ask you to change the admin password, the default login is:

  • Username: admin
  • Password: admin
Grafana login screen

Go to Settings > Data sources > Add data source. Select the Prometheus data source and fill in the IP and port of Prometheus.

Grafana configuring the Prometheus data source

Now you can play around and add dashboards, panels, etc. You can find more information here: https://grafana.com/docs/grafana/latest/getting-started/getting-started/

My dashboard looks like this:

Grafana dashboard

Configuring Grafana email alerts

Now that you have your dashboard running, you can also add Grafana alerts to graphs and send alerts to your inbox when something goes wrong.

To enable email alerts, we have to configure a SMTP server for Grafana. This can be done by adding the following to the Grafana configuration file:

[smtp]
enabled = true
host = <SMTP SERVER IP>:<SMTP SERVER PORT>
user = <USERNAME>
password = <PASSWORD>
from_address = <EMAIL TO USE>
from_name = <NAME SENDER>
ehlo_identity = <EMAIL>
startTLS_policy = <TLS ENABLED?>

Examples and explanation of each configuration parameter can be found in the documentation of Grafana: https://grafana.com/docs/grafana/latest/administration/configuration/#smtp

Now go to Alerts > Notification channels and configure a new notification channel with Email as type and add the email addresses to which Grafana must send your alerts. You can send a test alert as well from this page to make sure that the configuration is working.

If you open now a panel with a Graph as visualization, you can click on the Alert tab and add a new alert. Currently, Grafana can only add alerts to a Graph visualization and only a single alert per panel.

Once you configured the trigger rule and the notification message, click on apply and wait a bit. Grafana will trigger a notification (by default) if the rule is triggered for more than 5 minutes. This behaviour can be changed by editing the Alert’s rule ‘For’ parameter.

Grafana alert configuration and email alert