My Grafana dashboard was useless until I stopped monitoring everything

LOW

The severity is rated LOW due to the advisory's focus on best practices rather than a specific vulnerability. The advice provided helps in optimizing monitoring but does not address a direct security flaw or exploit vector.

This advisory discusses an important shift in monitoring strategy for Grafana and Prometheus setups, focusing on the criticality of selective metric monitoring to ensure effective system management. Initially, setting up a homelab with Grafana + Prometheus may lead one to monitor every possible metric (CPU, RAM, disk usage, network activity, etc.), which can overwhelm dashboards without providing actionable insights. However, the author emphasizes that only four specific metrics are necessary for maintaining system health and operational integrity: high disk usage (>85%), unexpected container stops, and delayed backup completion times. These metrics address critical issues such as storage saturation, service disruptions, and data protection failures, ensuring timely alerts to prevent system downtime or data loss.

Affected Systems

Grafana with Prometheus integration
ZFS storage systems
LXC containers

Affected Versions: All versions, as the advisory pertains to best practices and configuration optimization rather than specific software vulnerabilities

Remediation

Configure alert rules for disk usage above 85% in Grafana: `grafana-cli plugins install grafana-disk-usage-alert` followed by setting up alerts in Grafana's Alerting & Notification section.
Set up Uptime Kuma TCP checks for monitoring container statuses on primary ports using the command: `docker run -d --name uptime-kuma -p 3001:3001 louislam/uptime-kuma` and configuring checks within the Uptime Kuma dashboard.
Ensure nightly PBS backup jobs are configured with a retention policy, monitored through Grafana alert rules for backups older than 26 hours.

Stack Impact

This advisory impacts homelab stacks using Grafana + Prometheus for monitoring, particularly those employing ZFS storage and LXC containers. The specific impact is on the effectiveness of monitoring configurations, reducing noise from unnecessary metrics to focus on critical alerts.

Source →