Skip to main content
  1. Wiki/

Observability

Observability Stack

Visibility into 50+ services requires centralized logging, proactive alerting, and dashboards. This wiki covers my monitoring stack and the patterns that make it work.

Monitoring Stack
#

Monitoring Stack

Graylog Centralized Logging
#

Graylog is my log aggregation platform—collecting, processing, and visualizing logs from across the homelab.

Architecture
#

ComponentPurposeResources
GraylogWeb UI + log ingestion1 GB JVM heap
OpenSearchLog storage + full-text search1 GB JVM heap
MongoDBConfiguration metadata~200 MB

Storage: Dedicated 240 GB disk for OpenSearch indices.

Key insight: JVM heap tuning matters. Leaving ~2 GB free for OS filesystem cache dramatically improves OpenSearch query performance.

Log Shipping Methods
#

TransportPortUse CaseExample
Syslog TCP1514Appliance native syslogFirewall traffic logs
Syslog UDP1514rsyslog forwardingPi-hole DNS, NAS
GELF UDP12201Docker container logsCaddy, other stacks

GELF Docker Pattern
#

Docker services ship logs directly to Graylog:

1
2
3
4
5
6
7
services:
  myservice:
    logging:
      driver: gelf
      options:
        gelf-address: "udp://<GRAYLOG_IP>:12201"
        tag: "service-name"

Benefits:

  • Structured log fields (container name, image, etc.)
  • No log rotation management
  • docker logs still works (Docker dual logging)

Gotcha: Changing logging driver requires docker compose down && up -d, not just restart.

rsyslog Forwarding Pattern
#

rsyslog Forwarding Flow

For services without native Graylog support, rsyslog forwards logs:

rsyslog Forwarding Pattern

Example: Pi-hole DNS logs → rsyslog → Graylog → Dashboard.

Active Pipelines
#

Each log source has a processing pipeline that extracts structured fields:

StreamSourceField PrefixDashboards
Pi-hole DNSDNS serversdns_Query analysis, blocked domains
PAN-OS FirewallMain firewallfw_Traffic, threats, blocked activity
Synology NASNAS devicessyn_Storage, access logs
Caddy ProxyReverse proxycaddy_Request analysis
Internet ModemUpstream modemmodem_Connection logs

Pipeline Design Pattern
#

All pipelines use a single-stage pattern:

Graylog Pipeline Design Pattern

Why single stage? Graylog has a bug where match either in stage 0 prevents stage 1 from executing when no rules match. Single-stage with content-based exclusions avoids this.

Index Retention
#

All index sets use standardized retention:

SettingValue
Min Lifetime30 days
Max Lifetime90 days
StrategyTime-based with size optimization
DeletionAutomatic after max lifetime

Rationale: 30-90 days covers most troubleshooting needs. Older logs rarely needed; if they are, PBS backups have the original sources.

Dashboards
#

Graylog Dashboards
#

13 dashboards organized by function:

CategoryDashboardsPurpose
DNSPi-hole Overview, DNS Security, DNS OperationsQuery analysis, blocked domains
FirewallTraffic, Threats, Blocked, URLs, Network ActivitySecurity visibility
InfrastructureHomelab Security OverviewCross-service summary
ProxyCaddy OverviewRequest analysis

Dashboard creation: Python scripts using the Graylog REST API create dashboards programmatically. This enables version control and reproducible deployments.

Pulse (Proxmox Monitoring)
#

Pulse provides Proxmox-specific metrics:

MetricVisualization
CPU usage (per node)Time series graph
RAM usage (per node)Time series graph
Disk usageBar charts
Network I/OTime series graph
VM/LXC statusStatus cards

Integration: Read-only Proxmox API user (pulse@pve with PVEAuditor role).

Uptime Monitoring
#

Uptime Kuma
#

Two Uptime Kuma instances for redundant availability monitoring:

InstancePurpose
UTK-APrimary monitoring
UTK-BSecondary (monitors UTK-A too)

Why two? If UTK-A goes down, UTK-B notices and alerts. Single instance = blind spot.

Monitor Types
#

TypeUse CaseExample
HTTP(S)Web servicesGraylog UI, Proxmox API
TCP PortRaw connectivitySSH, database ports
PingBasic availabilityNetwork devices
DNSResolution checkPi-hole health

Monitor Strategy
#

Uptime Monitor Strategy

Alerting
#

Discord Integration
#

All monitoring sends alerts to Discord:

Discord Alert Embed

Alert Sources
#

SourceAlert Types
Uptime KumaService down/up, certificate expiry
GraylogLog-based alerts (high error rate, security events)
KeepalivedHA failover notifications
WUDContainer update availability
Backup scriptsBackup success/failure

Alert Fatigue Prevention
#

Strategies to avoid noise:

  1. Severity tiers: Only critical alerts wake me up
  2. Cooldown periods: 30-minute minimum between repeat alerts
  3. Flap detection: Ignore rapid up/down cycles (usually transient)
  4. Maintenance windows: Suppress alerts during planned work

Network Discovery
#

Pi.Alert
#

Scans subnets every 5 minutes for device discovery:

FeaturePurpose
New device detectionAlert on unknown devices
MAC trackingIdentify device moves
Port scanningDiscover services
Vendor lookupIdentify device types

Monitored subnets: Management VLAN, Server VLAN.

Observability Patterns
#

Log-Based Monitoring
#

1
Application → Graylog → Stream → Alert Condition → Discord

Example: “More than 10 DNS query failures in 5 minutes” → Alert.

Metric-Based Monitoring
#

1
Service → Prometheus/API → Pulse/Dashboard → Threshold → Alert

Example: “Proxmox node CPU > 90% for 5 minutes” → Alert.

Synthetic Monitoring
#

1
Uptime Kuma → Scheduled Request → Service → Response Check → Alert

Example: “Graylog /api/system/lbstatus doesn’t return ALIVE” → Alert.

Lessons Learned
#

1. JVM Heap Pinning
#

Graylog and OpenSearch both run on JVMs. Without explicit heap limits, they compete for RAM and starve the OS filesystem cache. Always pin JVM heap:

1
2
GRAYLOG_SERVER_JAVA_OPTS: "-Xms1g -Xmx1g"
OPENSEARCH_JAVA_OPTS: "-Xms1g -Xmx1g"

2. Graylog 7 API Gotchas
#

Graylog 7 changed its REST API significantly:

IssueSolution
entity cannot be nullWrap body in {"entity": <payload>}
Regex capture groups0-indexed: m["0"] = first group
Stream creationRequires explicit index_set_id
Dashboard series formatUse search_type format, not widget format

3. Log Retention Balance
#

Too short = missing data when you need it. Too long = disk full, slow queries.

30-90 days is the sweet spot for homelab scale.

4. Redundant Monitoring
#

The monitoring system itself needs monitoring. UTK-B watching UTK-A ensures no blind spots.

5. Pipeline Rules Only Apply at Ingestion
#

Graylog pipeline rules process messages at ingestion time only. Changing a rule doesn’t reprocess existing logs. For historical data, you must re-send the logs or query raw fields.

Quick Reference
#

Graylog API
#

1
2
3
4
5
6
7
8
# Health check
curl http://<GRAYLOG_IP>:9000/api/system/lbstatus

# List streams
curl -u admin:<PASSWORD> http://<GRAYLOG_IP>:9000/api/streams

# Search logs
curl -u admin:<PASSWORD> "http://<GRAYLOG_IP>:9000/api/views/search/messages?query=source:firewall"

Log Shipping Test
#

1
2
3
4
5
# Test syslog UDP
echo "<14>Test message from $(hostname)" | nc -u <GRAYLOG_IP> 1514

# Test GELF
echo '{"short_message":"Test","host":"test"}' | nc -u <GRAYLOG_IP> 12201

Related Pages#

2026

Architecture: Prometheus + Grafana on a Dedicated LXC

Overview # Migrated the Prometheus + Grafana monitoring stack from a shared Docker VM to a dedicated LXC container. The shared VM hosted multiple stacks (pgAdmin, Portainer, monitoring) which created resource contention and made lifecycle management messy. Moving monitoring to its own LXC follows the homelab pattern of one service per container for cleaner isolation, backups, and management.