My monitoring setup
My monitoring stack is relatively simple. For visualization, I use Grafana. Metrics are aggregated by Prometheus, logs by Loki, and alerting is managed by the Alertmanager
I chose these tools because of their massive communities, and in particular, the Prometheus ecosystem has an enormous number of plug and play exporters and intergration, making it highly modular and extensible.
Hardware
Because monitoring is a critical service for receiving alerts. I want to run in the out-of-band (OOB) segment of my network. This ensures I still get notified even if my main network goes down.
For that, I use a Raspberry PI to collect and sotre logs. Kubernetes metrics and logs from the various exproters are all forwarded to the board for cetnralization. Visualization alone runs on Kubernetes, to avoid overloading the Pi while, keeping dashboard easily and rapidly accessible.
Raspberry Pi Configuration
The Raspberry Pi (nicknamed “Azurite”) runs RaspbianOS. I would have preffered NixOS, but it wasn’t stable enough when I tried on the Pi. That said, I haven’t given up on it, maybe I’ll revisit it in the future.
Prometheus
Prometheus runs on the Raspberry Pi and scrapes metrics from all my machines and services. It also evaluates alerting rules and fires alerts via Alertmanager
Alloy
Alloy acts as a gateway between clients and both Prometheus and loki.
On Kubernetes
On Kubernetes, Alloy collects logs from all pods using Kubernetes services discovery, then forqward them to Loki
I deployed Alloy on Kubernetes via Helm, using the official grafana chart:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: grafana-alloy
namespace: monitoring
spec:
interval: 5m
chart:
spec:
chart: alloy
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
version: "1.2.0"
The rest of Alloy’s configurations lives in a ConfigMap. The first step is setting up the pod discovery and relabeling this lets me filter logs by namespace, app, container, or pod name in Loki:
discovery.kubernetes "pods" {
role = "pod"
}
discovery.relabel "pods" {
targets = discovery.kubernetes.pods.targets
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
action = "replace"
}
rule {
source_labels = ["__meta_kubernetes_pod_label_app_kubernetes_io_name"]
target_label = "app"
action = "replace"
}
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
action = "replace"
}
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
action = "replace"
}
}
Then all discovered logs are sent to Loki
loki.source.kubernetes "pods" {
targets = discovery.relabel.pods.output
forward_to = [loki.process.process.receiver]
}
loki.process "process" {
forward_to = [loki.write.loki.receiver]
stage.drop {
older_than = "1h"
drop_counter_reason = "too old"
}
stage.match {
selector = "{instance=~\".*\"}"
stage.json {
expressions = {
level = "\"level\"",
}
}
stage.labels {
values = {
level = "level",
}
}
}
stage.label_drop {
values = [ "service_name" ]
}
}
loki.write "loki" {
endpoint {
url = "https://XX.XX.XX.XX:3100/loki/api/v1/push"
}
}
Troubleshooting: Too Many Open Files
I ran into an issue where Alloy processes were opening too many files simultaneously, causing log collection errors.
The fix required two changes.
First, adding a ulimit via an initcontainer:
initContainers:
- name: raise-ulimit
image: busybox
command: ["sh", "-c", "ulimit -n 65536"]
Second, was to add sysctl rules to remove inotify limits in the NixOS nodes
configuration:
boot.kernel.sysctl = {
"fs.inotify.max_user_watches" = "2099999999";
"fs.inotify.max_user_instances" = "2099999999";
"fs.inotify.max_queued_events" = "2099999999";
};
These two changes resolves all log scraping erros on Kubernetes.
On Virtual MAchines and Phyiscal Hosts
VMs and physical machines also use Alloy, this time to ship journald logs to Loki
// Relabel journal logs to extract the systemd unit
loki.relabel "journal" {
forward_to = []
rule {
source_labels = ["__journal__systemd_unit"]
target_label = "unit"
}
}
// Read logs from jounrlad
loki.source.journal "read" {
forward_to = [loki.write.endpoint.receiver]
relabel_rules = loki.relabel.journal.rules
labels = {
component = "loki.source.journal",
host = "nixos-builder",
}
}
// Forward log to loki
loki.write "endpoint" {
endpoint {
url = "https://XX.XX.XX.XX:3100/loki/api/v1/push"
}
}
This configuration is baked into my NixOS tempalate, so every Nixos machine I deploy gets it by defautlt.
On the Raspberry Pi - Syslog Ingestion
The Pi’s Alloy instance also receives Syslog messages and forwards them to Loki. This is useful for network devices and other equipement that doesn’t support Alloy natively. Alloy acts as a Syslog-to-JSON converter for Loki:
otelcol.receiver.syslog "default" {
protocol = "rfc5424"
tcp {
listen_address = "localhost:1515"
}
output {
logs = [otelcol.exporter.syslog.default.input]
}
}
otelcol.exporter.syslog "default" {
endpoint = "localhost"
network = "tcp"
port = 1514
protocol = "rfc5424"
enable_octet_counting = false
}
loki.source.syslog "default" {
listener {
address = "localhost:1514"
protocol = "tcp"
syslog_format = "rfc5424"
label_structured_data = true
use_rfc5424_message = true
}
forward_to = [loki.write.default.receiver]
}
Loki
Loki is designet to run as a cluster, but since I only have a single node, I confiugred it in single node mode. All Alloy clients send logs directly to it.
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
grpc_server_max_concurrent_streams: 1000
common:
instance_addr: 127.0.0.1
path_prefix: /var/lib/loki
storage:
filesystem:
chunks_directory: /var/lib/loki/chunks
rules_directory: /var/lib/loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100
limits_config:
metric_aggregation_enabled: true
volume_enabled: true
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
frontend:
encoding: protobuf
Once Loki is configured, logs are visible in grafana:

Blackbox Exporter
Blackbox is a prober that can check targets over HTTP, HTTPS, TCP, ICMP, and gRPC. I use it to monitor the availability of my services and get alerts when they go down.
Docker Deployment
blackbox:
image: quay.io/prometheus/blackbox-exporter:latest
command: --config.file=/config/blackbox.yml --config.enable-auto-reload
ports:
- "9115:9115"
volumes:
- ./blackbox:/config
Blackbox configuration
Blackbox has its own config file that fefines how it probes targets:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [] # Defaults to 2xx
method: GET
headers:
Host: vhost.example.com
Origin: example.com
http_headers:
Accept-Language:
values:
- "en-US"
follow_redirects: true
fail_if_ssl: false
fail_if_not_ssl: false
fail_if_body_matches_regexp:
- "Could not connect to database"
fail_if_body_not_matches_regexp:
- "Download the latest version here"
fail_if_header_matches: # Verifies that no cookies are set
- header: Set-Cookie
allow_missing: true
regexp: '.*'
fail_if_header_not_matches:
- header: Access-Control-Allow-Origin
regexp: '(\*|example\.com)'
tls_config:
insecure_skip_verify: false
preferred_ip_protocol: "ip4" # defaults to "ip6"
ip_protocol_fallback: false # no fallback to "ip6"
Prometheus Job
A Prometheus job is needed to scrape Blackbox results:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://nixos.org
- pdf.ridercorp.org
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115 # Relabel avec la vrai url de blackblox
- job_name: 'blackbox_exporter' # Surveiller blackbox
static_configs:
- targets: ['blackbox:9115']
Load a Grafana dashboard and you can see the status of all your targets at a glance:

NUT - Network UPS Tools
What is NUT?
NUT (Network UPS Tools) is an open-source project for monitoring and controlling UPS devices over USB, SNMP, and many other protocols. It can also act as a UPS server, allowing connected machines to gracefully shut down during a power outage.
How I Use It
I use NUT primarily with Prometheus to receve an alert whenever my UPS switches to battery power. This is essetnially my only reliable indicator of a power outage at home.
NUT on the Raspberry PI
My board connects to UPS devices via USB, so NUT uses the
usbhid-ups driver.
Install NUT withh:
sudo apt update && sudo apt intall nut nut -client nut-server
The configure /etc/nut/nut.conf to expose the daemon on a port
MODE=netserver
Since everything runs in Docker, I just add the NUT exporter container:
nut-exporter:
image: ghcr.io/druggeri/nut_exporter:3.2.2
restart: unless-stopped
network_mode: "host"
ports:
- "9199:9199"
And add the Prometheus scrape job:
- job_name: "ups-1"
metrics_path: /ups_metrics
static_configs:
- targets: ["XX.XX.XX.XX:9199"]
labels:
ups: "elipse1600"
params:
ups: ["elipse1600"]
- job_name: "ups-2"
metrics_path: /ups_metrics
static_configs:
- targets: ["XX.XX.XX.XX:9199"]
labels:
ups: "eaton3s"
params:
ups: ["eaton3s"]
Load a nice dahsboard and the result speaks for itselfs:

Alertmanager
Alertmanager handles routing and delivering alerts. It can fan out to multiple platforms simultaneously.
Future improvement: I’d like to set up high-availability Alertmanager in my OOB network using the gossip protocol. If you have recommendations on how to do this, feel free to reach out!
Docker Configuration
alertmanager:
image: prom/alertmanager:v0.31.0
restart: unless-stopped
command: --config.file=/data/alertmanager.yml
volumes:
- ./alertmanager:/data
ports:
- "9093:9093"
Alertmanager Config
Prometheus is responsible for evaluating rules and pushing fired alerts to Alertmanager:
global:
resolve_timeout: 5m
route:
group_wait: 10s
group_interval: 30s
repeat_interval: 1h
receiver: 'discord-receiver'
receivers:
- name: 'discord-receiver'
discord_configs:
- webhook_url: "https://discord.com/api/webhooks/XXX/XXXXX/"
I’m currently using Discord for notifications, though I’m looking for a more purpose-built alerting platform with proper webhook support.
Connecting Prometheus to Alertmanager
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Alert Rules
I organize rules into one file per topic:
rule_files:
- /etc/prometheus/ups-rules.yml
My first alert fires immediately when a UPS goes off-line (i.e., switches to battery):
groups:
- name: ups
rules:
- alert: UPSNotOnline
expr: network_ups_tools_ups_status{flag="OL"} == 0
for: 0s
labels:
severity: critical
annotations:
summary: UPS {{ $labels.ups }} is not online"
The for: 0s ensures the alert fires instantly — no delay tolerated for power events.
And here’s what it looks like in Discord:

Conclusion
There’s still a lot for me to learn about monitoring. This is a first pass at covering the critical parts of my infrastructure. The main areas I want to improve:
- Better Kubernetes observability — more optimal integration of the metrics Kubernetes exposes, and clearer targeting of what actually matters.
- Automated alert creation — a more systematic approach rather than writing rules by hand.
- Prometheus + Netbox integration — linking my monitoring stack to my network inventory could open up some interesting automation possibilities.
These are the directions I see for evolving this setup in the future.