How I montior my infrastructure

Mar 3, 2026 min read

My monitoring setup

My monitoring stack is relatively simple. For visualization, I use Grafana. Metrics are aggregated by Prometheus, logs by Loki, and alerting is managed by the Alertmanager

I chose these tools because of their massive communities, and in particular, the Prometheus ecosystem has an enormous number of plug and play exporters and intergration, making it highly modular and extensible.


Hardware

Because monitoring is a critical service for receiving alerts. I want to run in the out-of-band (OOB) segment of my network. This ensures I still get notified even if my main network goes down.

For that, I use a Raspberry PI to collect and sotre logs. Kubernetes metrics and logs from the various exproters are all forwarded to the board for cetnralization. Visualization alone runs on Kubernetes, to avoid overloading the Pi while, keeping dashboard easily and rapidly accessible.


Raspberry Pi Configuration

The Raspberry Pi (nicknamed “Azurite”) runs RaspbianOS. I would have preffered NixOS, but it wasn’t stable enough when I tried on the Pi. That said, I haven’t given up on it, maybe I’ll revisit it in the future.

Prometheus

Prometheus runs on the Raspberry Pi and scrapes metrics from all my machines and services. It also evaluates alerting rules and fires alerts via Alertmanager

Alloy

Alloy acts as a gateway between clients and both Prometheus and loki.

On Kubernetes

On Kubernetes, Alloy collects logs from all pods using Kubernetes services discovery, then forqward them to Loki

I deployed Alloy on Kubernetes via Helm, using the official grafana chart:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: grafana-alloy
  namespace: monitoring
spec:
  interval: 5m
  chart:
    spec:
      chart: alloy
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
      version: "1.2.0"

The rest of Alloy’s configurations lives in a ConfigMap. The first step is setting up the pod discovery and relabeling this lets me filter logs by namespace, app, container, or pod name in Loki:

discovery.kubernetes "pods" {
            role = "pod"
          }
          discovery.relabel "pods" {
            targets = discovery.kubernetes.pods.targets

            rule {
              source_labels = ["__meta_kubernetes_namespace"]
              target_label = "namespace"
              action = "replace"
            }

            rule {
              source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
              target_label = "app"
              action = "replace"
            }

            rule {
              source_labels = ["__meta_kubernetes_pod_container_name"]
              target_label = "container"
              action = "replace"
            }

            rule {
              source_labels = ["__meta_kubernetes_pod_name"]
              target_label = "pod"
              action = "replace"
            }
          }

Then all discovered logs are sent to Loki

loki.source.kubernetes "pods" {
            targets    = discovery.relabel.pods.output
            forward_to = [loki.process.process.receiver]
          }
          loki.process "process" {
            forward_to = [loki.write.loki.receiver]

            stage.drop {
              older_than          = "1h"
              drop_counter_reason = "too old"
            }
            stage.match {
              selector = "{instance=~\".*\"}"
              stage.json {
                expressions = {
                  level = "\"level\"",
                }
              }
              stage.labels {
                values = {
                  level = "level",
                }
              }
            }
            stage.label_drop {
              values = [ "service_name" ]
            }
          }
          loki.write "loki" {
            endpoint {
              url = "https://XX.XX.XX.XX:3100/loki/api/v1/push"
            }
          }

Troubleshooting: Too Many Open Files

I ran into an issue where Alloy processes were opening too many files simultaneously, causing log collection errors.

The fix required two changes. First, adding a ulimit via an initcontainer:

initContainers:
        - name: raise-ulimit
          image: busybox
          command: ["sh", "-c", "ulimit -n 65536"]

Second, was to add sysctl rules to remove inotify limits in the NixOS nodes configuration:

  boot.kernel.sysctl = {
    "fs.inotify.max_user_watches" = "2099999999";
    "fs.inotify.max_user_instances" = "2099999999";
    "fs.inotify.max_queued_events" = "2099999999";
  };

These two changes resolves all log scraping erros on Kubernetes.


On Virtual MAchines and Phyiscal Hosts

VMs and physical machines also use Alloy, this time to ship journald logs to Loki


// Relabel journal logs to extract the systemd unit 
loki.relabel "journal" {
  forward_to = []

  rule {
    source_labels = ["__journal__systemd_unit"]
    target_label  = "unit"
  }
}

// Read logs from jounrlad

loki.source.journal "read" {
  forward_to    = [loki.write.endpoint.receiver]
  relabel_rules = loki.relabel.journal.rules
  labels = {
    component = "loki.source.journal",
    host      = "nixos-builder",
  }
}

// Forward log to loki
loki.write "endpoint" {
  endpoint {
    url = "https://XX.XX.XX.XX:3100/loki/api/v1/push"
  }
}

This configuration is baked into my NixOS tempalate, so every Nixos machine I deploy gets it by defautlt.

On the Raspberry Pi - Syslog Ingestion

The Pi’s Alloy instance also receives Syslog messages and forwards them to Loki. This is useful for network devices and other equipement that doesn’t support Alloy natively. Alloy acts as a Syslog-to-JSON converter for Loki:


otelcol.receiver.syslog "default" {
    protocol = "rfc5424"
    tcp {
        listen_address = "localhost:1515"
    }
    output {
        logs = [otelcol.exporter.syslog.default.input]
    }
}

otelcol.exporter.syslog "default" {
    endpoint = "localhost"
    network = "tcp"
    port = 1514
    protocol = "rfc5424"
    enable_octet_counting = false
}

loki.source.syslog "default" {
  listener {
    address = "localhost:1514"
    protocol = "tcp"
    syslog_format = "rfc5424"
    label_structured_data = true
    use_rfc5424_message = true
  }
  forward_to = [loki.write.default.receiver]
}

Loki

Loki is designet to run as a cluster, but since I only have a single node, I confiugred it in single node mode. All Alloy clients send logs directly to it.

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
  metric_aggregation_enabled: true
  volume_enabled: true

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h


ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

Once Loki is configured, logs are visible in grafana:

Grafana Logs

Blackbox Exporter

Blackbox is a prober that can check targets over HTTP, HTTPS, TCP, ICMP, and gRPC. I use it to monitor the availability of my services and get alerts when they go down.

Docker Deployment

  blackbox:
    image: quay.io/prometheus/blackbox-exporter:latest
    command: --config.file=/config/blackbox.yml --config.enable-auto-reload
    ports:
      - "9115:9115"
    volumes:
      - ./blackbox:/config

Blackbox configuration

Blackbox has its own config file that fefines how it probes targets:


modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      headers:
        Host: vhost.example.com
        Origin: example.com
      http_headers:
        Accept-Language:
          values:
            - "en-US"
      follow_redirects: true
      fail_if_ssl: false
      fail_if_not_ssl: false
      fail_if_body_matches_regexp:
        - "Could not connect to database"
      fail_if_body_not_matches_regexp:
        - "Download the latest version here"
      fail_if_header_matches: # Verifies that no cookies are set
        - header: Set-Cookie
          allow_missing: true
          regexp: '.*'
      fail_if_header_not_matches:
        - header: Access-Control-Allow-Origin
          regexp: '(\*|example\.com)'
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4" # defaults to "ip6"
      ip_protocol_fallback: false  # no fallback to "ip6"

Prometheus Job

A Prometheus job is needed to scrape Blackbox results:

  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://nixos.org
        - pdf.ridercorp.org
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115  # Relabel avec la vrai url de blackblox
  - job_name: 'blackbox_exporter' # Surveiller blackbox
    static_configs:
      - targets: ['blackbox:9115']

Load a Grafana dashboard and you can see the status of all your targets at a glance:

Blackbox Dashboard


NUT - Network UPS Tools

What is NUT?

NUT (Network UPS Tools) is an open-source project for monitoring and controlling UPS devices over USB, SNMP, and many other protocols. It can also act as a UPS server, allowing connected machines to gracefully shut down during a power outage.

How I Use It

I use NUT primarily with Prometheus to receve an alert whenever my UPS switches to battery power. This is essetnially my only reliable indicator of a power outage at home.

NUT on the Raspberry PI

My board connects to UPS devices via USB, so NUT uses the usbhid-ups driver. Install NUT withh:

sudo apt update && sudo apt intall nut nut -client nut-server

The configure /etc/nut/nut.conf to expose the daemon on a port

MODE=netserver

Since everything runs in Docker, I just add the NUT exporter container:

nut-exporter:
  image: ghcr.io/druggeri/nut_exporter:3.2.2
  restart: unless-stopped
  network_mode: "host"
  ports:
    - "9199:9199"

And add the Prometheus scrape job:

- job_name: "ups-1"
  metrics_path: /ups_metrics
  static_configs:
  - targets: ["XX.XX.XX.XX:9199"]
    labels:
      ups: "elipse1600"

  params:
    ups: ["elipse1600"]


- job_name: "ups-2"
  metrics_path: /ups_metrics
  static_configs:
  - targets: ["XX.XX.XX.XX:9199"]
    labels:
      ups: "eaton3s"


  params:
    ups: ["eaton3s"]

Load a nice dahsboard and the result speaks for itselfs:

dahsboard NUT

Alertmanager

Alertmanager handles routing and delivering alerts. It can fan out to multiple platforms simultaneously.

Future improvement: I’d like to set up high-availability Alertmanager in my OOB network using the gossip protocol. If you have recommendations on how to do this, feel free to reach out!

Docker Configuration

  alertmanager:
    image: prom/alertmanager:v0.31.0
    restart: unless-stopped
    command: --config.file=/data/alertmanager.yml
    volumes:
      - ./alertmanager:/data
    ports:
      - "9093:9093"

Alertmanager Config

Prometheus is responsible for evaluating rules and pushing fired alerts to Alertmanager:


global:
  resolve_timeout: 5m

route:
  group_wait: 10s
  group_interval: 30s
  repeat_interval: 1h
  receiver: 'discord-receiver'


receivers:
- name: 'discord-receiver'
  discord_configs:
    - webhook_url: "https://discord.com/api/webhooks/XXX/XXXXX/"

I’m currently using Discord for notifications, though I’m looking for a more purpose-built alerting platform with proper webhook support.

Connecting Prometheus to Alertmanager

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

Alert Rules

I organize rules into one file per topic:


rule_files:
 - /etc/prometheus/ups-rules.yml

My first alert fires immediately when a UPS goes off-line (i.e., switches to battery):

groups:
- name: ups
  rules:
  - alert: UPSNotOnline
    expr:  network_ups_tools_ups_status{flag="OL"} == 0
    for: 0s
    labels:
      severity: critical
    annotations:
      summary: UPS {{ $labels.ups }} is not online"

The for: 0s ensures the alert fires instantly — no delay tolerated for power events.

And here’s what it looks like in Discord:

alerte-discord

Conclusion

There’s still a lot for me to learn about monitoring. This is a first pass at covering the critical parts of my infrastructure. The main areas I want to improve:

  • Better Kubernetes observability — more optimal integration of the metrics Kubernetes exposes, and clearer targeting of what actually matters.
  • Automated alert creation — a more systematic approach rather than writing rules by hand.
  • Prometheus + Netbox integration — linking my monitoring stack to my network inventory could open up some interesting automation possibilities.

These are the directions I see for evolving this setup in the future.