DNS is the backbone of your network. When your Pi-hole goes down, every device in your home loses internet access. Websites won’t load. Apps stop working. Smart home devices go offline. It’s a single point of failure that brings everything to a halt.
This tutorial shows you how to build a resilient DNS infrastructure using two Pi-hole servers with automatic failover. If one server dies, the other seamlessly takes over in under 15 seconds — without any manual intervention.
What you’ll build:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Client Devices
↓
Firewall DNS Proxy
↓
Pi-hole VIP (<DNS_VIP>)
↙ ↘
DNS-Primary DNS-Secondary
<DNS_PRIMARY_IP> <DNS_SECONDARY_IP>
Priority: 200 Priority: 100
↓ ↓
Health Check Health Check
(every 5 sec) (every 5 sec)
↓ ↓
VIP ACTIVE when Takes VIP when
healthy (MASTER) Primary fails
Key features:
Automatic failover - VIP moves to backup server when primary fails
Health monitoring - Checks FTL service, port 53, DNS resolution
Discord notifications - Real-time alerts on state changes
Staggered updates - Never update both servers simultaneously
Zero client configuration - Clients always use the same VIP address
The high-availability setup uses VRRP (Virtual Router Redundancy Protocol) to manage a floating IP address that automatically moves between servers based on health status.
#!/bin/bash
set -euo pipefail
## Pi-hole Health Check Script for Keepalived# Verifies that Pi-hole FTL service is running and responding to DNS queries## Exit codes:# 0 = Pi-hole is healthy (keepalived continues as MASTER/BACKUP)# 1 = Pi-hole is unhealthy (keepalived reduces priority)## Check if pihole-FTL service is activeif ! systemctl is-active --quiet pihole-FTL;then logger -t keepalived-healthcheck "FAILED: pihole-FTL service is not running"exit1fi# Check if FTL is listening on port 53 (DNS)if ! ss -tulpn | grep -q ':53.*pihole-FTL';then logger -t keepalived-healthcheck "FAILED: pihole-FTL not listening on port 53"exit1fi# Perform actual DNS query test using dig# Query DNS infrastructure domains to verify DNS resolution is working# Uses redundant targets to avoid false positives from single-site outagesifcommand -v dig &> /dev/null;then# Try Google's DNS infrastructure first (dns.google = 8.8.8.8)if ! timeout 1.0 dig @127.0.0.1 +short +tries=1 +time=1 dns.google &> /dev/null;then# First query failed, try Cloudflare as fallback (one.one.one.one = 1.1.1.1)if ! timeout 1.0 dig @127.0.0.1 +short +tries=1 +time=1 one.one.one.one &> /dev/null;then logger -t keepalived-healthcheck "FAILED: DNS query test failed (both targets unreachable)"exit1fifielse# Fallback to nc if dig is not availableif ! echo -e "q\n"| timeout 1.0 nc -u 127.0.0.1 53&> /dev/null;then logger -t keepalived-healthcheck "FAILED: Port 53 connectivity test failed"exit1fifi# All checks passedexit0
Deploy and set permissions:
1
2
3
4
5
6
7
8
9
10
11
12
# Copy to both serversscp check_pihole.sh user@<DNS_PRIMARY_IP>:/tmp/
scp check_pihole.sh user@<DNS_SECONDARY_IP>:/tmp/
# On each server, move to final locationssh user@<DNS_PRIMARY_IP> "sudo mv /tmp/check_pihole.sh /etc/keepalived/ && \
sudo chmod 750 /etc/keepalived/check_pihole.sh && \
sudo chown root:root /etc/keepalived/check_pihole.sh"ssh user@<DNS_SECONDARY_IP> "sudo mv /tmp/check_pihole.sh /etc/keepalived/ && \
sudo chmod 750 /etc/keepalived/check_pihole.sh && \
sudo chown root:root /etc/keepalived/check_pihole.sh"
Why redundant DNS queries?
Using multiple targets (dns.google + one.one.one.one) prevents false positives. If Google’s DNS is briefly unreachable, Cloudflare provides a fallback. Both have 99.99%+ uptime SLAs. The health check only fails if both queries fail, indicating a real Pi-hole problem.
global_defs { router_id DNS_PRIMARY
enable_script_security
script_user root
}vrrp_script check_pihole { script "/etc/keepalived/check_pihole.sh" interval 5# Run every 5 seconds timeout 2# Script must complete within 2 seconds fall 3# Require 3 consecutive failures before declaring unhealthy rise 2# Require 2 consecutive successes before declaring healthy weight -150 # Reduce priority by 150 when unhealthy}vrrp_instance DNS_HA { state MASTER # Initial state interface eth0 # Network interface virtual_router_id 55# Must match on both servers priority 200# Higher = preferred MASTER advert_int 1# Advertise every 1 second authentication { auth_type PASS
auth_pass YourSecretPassword # Change this! Must match on both servers} unicast_src_ip <DNS_PRIMARY_IP> # This server's IP unicast_peer { <DNS_SECONDARY_IP> # Peer server's IP} virtual_ipaddress { <DNS_VIP>/24 # Virtual IP with subnet mask} track_script { check_pihole
} notify "/etc/keepalived/keepalived-discord-notify.sh"}
Key parameters explained:
Parameter
Value
Why
priority
200
Primary’s base priority (higher than secondary’s 100)
weight
-150
Drops priority to 50 when unhealthy (below secondary’s 100)
fall
3
Requires 15 seconds of failures (3 × 5 sec) before failover
rise
2
Requires 10 seconds of health (2 × 5 sec) before recovery
virtual_router_id
55
Arbitrary number, must match on both servers
Critical: The weight value must drop primary’s priority below secondary’s priority to trigger failover. If weight is too small (e.g., -50), the calculation becomes 200 - 50 = 150, which is still higher than secondary’s 100, preventing failover.
global_defs { router_id DNS_SECONDARY # Changed: Different router ID enable_script_security
script_user root
}vrrp_script check_pihole { script "/etc/keepalived/check_pihole.sh" interval 5 timeout 2 fall 3 rise 2 weight -150
}vrrp_instance DNS_HA { state BACKUP # Changed: Initial state BACKUP interface eth0
virtual_router_id 55 priority 100# Changed: Lower priority advert_int 1 authentication { auth_type PASS
auth_pass YourSecretPassword # Same as primary} unicast_src_ip <DNS_SECONDARY_IP> # Changed: This server's IP unicast_peer { <DNS_PRIMARY_IP> # Changed: Primary's IP} virtual_ipaddress { <DNS_VIP>/24
} track_script { check_pihole
} notify "/etc/keepalived/keepalived-discord-notify.sh"}
Deploy configurations:
1
2
3
4
5
6
7
8
9
10
# Copy configs to serversscp keepalived-primary.conf user@<DNS_PRIMARY_IP>:/tmp/keepalived.conf
scp keepalived-secondary.conf user@<DNS_SECONDARY_IP>:/tmp/keepalived.conf
# Move to final location on each serverssh user@<DNS_PRIMARY_IP> "sudo mv /tmp/keepalived.conf /etc/keepalived/keepalived.conf && \
sudo chmod 644 /etc/keepalived/keepalived.conf"ssh user@<DNS_SECONDARY_IP> "sudo mv /tmp/keepalived.conf /etc/keepalived/keepalived.conf && \
sudo chmod 644 /etc/keepalived/keepalived.conf"
Validate syntax before starting:
1
2
3
4
5
6
7
# On primaryssh user@<DNS_PRIMARY_IP> "sudo keepalived -t -f /etc/keepalived/keepalived.conf"# On secondaryssh user@<DNS_SECONDARY_IP> "sudo keepalived -t -f /etc/keepalived/keepalived.conf"# Both should output: "Configuration is using : 0 Bytes"
# On primary - should show VIP activessh user@<DNS_PRIMARY_IP> "ip addr show eth0 | grep <DNS_VIP>"# On secondary - should NOT show VIPssh user@<DNS_SECONDARY_IP> "ip addr show eth0 | grep <DNS_VIP>"# Check keepalived logsssh user@<DNS_PRIMARY_IP> "sudo journalctl -u keepalived --since '5 minutes ago' --no-pager"
You should see Discord notifications for both servers: DNS-Secondary entering BACKUP state, DNS-Primary entering MASTER state.
Critical: Never update both Pi-hole servers simultaneously. If an update breaks DNS, you’ll have zero redundancy. Stagger updates by at least 3 days.
File:/etc/crontab (add to each server)
1
2
3
4
5
# On DNS-Primary - 1st of each month at 3:00 AM031 * * root /usr/local/bin/pihole-auto-update.sh
# On DNS-Secondary - 4th of each month at 3:00 AM034 * * root /usr/local/bin/pihole-auto-update.sh
Create update script:
File:/usr/local/bin/pihole-auto-update.sh (deploy to both servers)
#!/bin/bash
set -euo pipefail
## Pi-hole Automatic Update Script# Runs monthly updates and logs the results#LOGFILE="/var/log/pihole-auto-update.log"HOSTNAME=$(hostname)TIMESTAMP=$(date "+%Y-%m-%d %H:%M:%S")echo"========================================"| tee -a "$LOGFILE"echo"Pi-hole Auto-Update Started"| tee -a "$LOGFILE"echo"Server: $HOSTNAME"| tee -a "$LOGFILE"echo"Time: $TIMESTAMP"| tee -a "$LOGFILE"echo"========================================"| tee -a "$LOGFILE"echo""| tee -a "$LOGFILE"# Run Pi-hole updateecho"Running: pihole -up"| tee -a "$LOGFILE"pihole -up 2>&1| tee -a "$LOGFILE"UPDATE_EXIT_CODE=${PIPESTATUS[0]}TIMESTAMP_END=$(date "+%Y-%m-%d %H:%M:%S")echo""| tee -a "$LOGFILE"echo"========================================"| tee -a "$LOGFILE"echo"Pi-hole Auto-Update Completed"| tee -a "$LOGFILE"echo"End Time: $TIMESTAMP_END"| tee -a "$LOGFILE"echo"Exit Code: $UPDATE_EXIT_CODE"| tee -a "$LOGFILE"echo"========================================"| tee -a "$LOGFILE"# Log to syslogif[$UPDATE_EXIT_CODE -eq 0];then logger -t pihole-auto-update "[$HOSTNAME] Pi-hole update completed successfully"else logger -t pihole-auto-update "[$HOSTNAME] Pi-hole update failed with exit code $UPDATE_EXIT_CODE"fiexit$UPDATE_EXIT_CODE
If the update on DNS-Primary causes issues, you have 3 days to detect and fix the problem before DNS-Secondary updates to the same broken version. This window gives you time to monitor logs, check for regressions, and roll back if needed.
# Should show VIP on primaryssh user@<DNS_PRIMARY_IP> "ip addr show eth0 | grep <DNS_VIP>"# Should show nothing on secondaryssh user@<DNS_SECONDARY_IP> "ip addr show eth0 | grep <DNS_VIP>"
Mistake: DNS-Primary used DNS-Secondary for DNS, DNS-Secondary used DNS-Primary for DNS (circular dependency).
Problem: When one server rebooted, the other couldn’t resolve DNS during gravity updates → 12+ minute failures → cascading outage.
Fix: Both servers now use gateway/firewall (which has static DNS entries) as primary DNS, Cloudflare 1.1.1.2 as fallback. Neither depends on the other.
Lesson: HA pairs should never depend on each other for critical services. Use external/upstream dependencies.
Failover takes 15 seconds (3 consecutive health check failures). During that window, DNS queries time out. Design for resilience, not perfection. 15 seconds of DNS unavailability is acceptable when it prevents hours of manual recovery.
Without Discord notifications, I wouldn’t know failover occurred until I checked logs. Real-time alerts turn HA from “set and forget” to “observe and optimize.”
Initial health check settings (0.5s timeout, 2-failure threshold) caused flapping. Production data showed that 1.0s timeout with 3-failure threshold eliminated false positives. Trust your metrics, not your assumptions.
Updates, backups, maintenance windows — never do both servers simultaneously. The 3-day update stagger saved me when a Pi-hole update introduced a regression that broke DNSSEC. Primary got the broken version, I caught it in logs, held back secondary’s update.
I considered adding Consul for health checks, Prometheus for metrics, Ansible for config management. Instead: keepalived (simple, battle-tested, built-in health checks), bash scripts (readable, auditable), rsyslog (ubiquitous). Fewer dependencies = fewer failure modes.
Six months from now, you won’t remember why you set weight -150 instead of -100. Document the math. Explain the rationale. Your future self will thank you.