This advisory discusses an important shift in monitoring strategy for Grafana and Prometheus setups, focusing on the criticality of selective metric monitoring to ensure effective system management. Initially, setting up a homelab with Grafana + Prometheus may lead one to monitor every possible metric (CPU, RAM, disk usage, network activity, etc.), which can overwhelm dashboards without providing actionable insights. However, the author emphasizes that only four specific metrics are necessary for maintaining system health and operational integrity: high disk usage (>85%), unexpected container stops, and delayed backup completion times. These metrics address critical issues such as storage saturation, service disruptions, and data protection failures, ensuring timely alerts to prevent system downtime or data loss.
- Grafana with Prometheus integration
- ZFS storage systems
- LXC containers
- Configure alert rules for disk usage above 85% in Grafana: `grafana-cli plugins install grafana-disk-usage-alert` followed by setting up alerts in Grafana's Alerting & Notification section.
- Set up Uptime Kuma TCP checks for monitoring container statuses on primary ports using the command: `docker run -d --name uptime-kuma -p 3001:3001 louislam/uptime-kuma` and configuring checks within the Uptime Kuma dashboard.
- Ensure nightly PBS backup jobs are configured with a retention policy, monitored through Grafana alert rules for backups older than 26 hours.
This advisory impacts homelab stacks using Grafana + Prometheus for monitoring, particularly those employing ZFS storage and LXC containers. The specific impact is on the effectiveness of monitoring configurations, reducing noise from unnecessary metrics to focus on critical alerts.