Skip to main content
  1. Wiki/

Infrastructure

Infrastructure Architecture

My homelab runs on a 4-node Proxmox VE cluster hosting 50+ LXC containers and VMs. This wiki documents the architecture, conventions, and lessons learned.

Proxmox Cluster Architecture
#

Proxmox Cluster Architecture

Cluster Specifications
#

NodeStorageTypeCPURAMPrimary Workloads
Node 2ssd-dataLVM-thin4 cores16 GBPBS, Development
Node 3zdataZFS4 cores32 GBDatabases, DNS-Primary
Node 5ssd-dataLVM-thin4 cores16 GBGraylog VM, DNS-Secondary
Node 6zdataZFS4 cores32 GBDocker-Main, HA services

Total Resources:

  • πŸ–₯️ 16 CPU cores available for VMs/LXCs
  • πŸ’Ύ 96 GB RAM across cluster
  • πŸ“€ ~2 TB combined storage (SSD + ZFS)
  • πŸ“¦ 50+ containers running

Why Mixed Storage?
#

Storage TypeAdvantagesBest For
LVM-thinFast snapshots, thin provisioning, SSD optimizedGeneral workloads, development
ZFSChecksumming, compression, data integrityDatabases, critical data

VM/LXC ID Naming Convention
#

I use a deterministic ID scheme that encodes network location:

VM/LXC ID Formula

Benefits:

  • πŸ” Instantly know a container’s IP from its ID
  • βœ… Avoid IP conflicts during provisioning
  • πŸ“š Simplify documentation and troubleshooting
  • πŸ”’ Consistent across all 50+ containers

Container Strategy
#

LXC vs Docker Decision Tree
#

LXC vs Docker Decision Tree

AspectLXC ContainersDocker-in-LXCFull VM
IsolationFull OS, systemdDocker + OSComplete
BackupProxmox snapshotsProxmox + volumesProxmox snapshots
Use CaseNative servicesDocker Compose appsHigh I/O, special kernel
ExamplesPi-hole, SemaphoreGraylog, CaddyDocker-Main, KASM

Docker-in-LXC Pattern
#

For services that ship as Docker Compose stacks:

Docker-in-LXC Pattern

Why not Docker directly on Proxmox?

  • βœ… Proxmox backups capture entire LXC state
  • βœ… Network isolation via Proxmox VLANs
  • βœ… Resource limits enforced at LXC level
  • βœ… Easier migration between nodes

High Availability Services
#

Three service categories run as HA pairs with automatic failover:

High Availability Pairs

ServicePrimary NodeSecondary NodeFailover TechRTO
DNSNode 3Node 5keepalived VRRP~15s
CaddyNode 6Node 5keepalived VRRP~10s
NFSNode 6Node 5keepalived + rsync~10s

Node distribution strategy: Primary services split across nodes 3/6, secondaries on node 5. This ensures a single node failure doesn’t take down all primaries.

Backup Strategy
#

Proxmox Backup Server (PBS)
#

PBS Backup Pipeline

SettingValueRationale
StorageNFS from NASOffsite from compute nodes
Retention7 backupsWeekly rotation
CompressionZSTDGood compression/speed balance
ModeSnapshotLive backups, no downtime

Schedules:

  • πŸ—‘οΈ Garbage Collection: Daily at 02:00
  • βœ… Verification: Weekly on Sunday at 03:00
  • πŸ’Ύ Backup Jobs: Staggered throughout night

Application-Level Backups
#

Critical applications also have their own backup scripts:

Application-Level Backup Pipeline

Provisioning Automation
#

New VMs/LXCs are provisioned via a Python automation tool:

Provisioning Automation

Automation steps:

  1. Query NetBox for next available IP in target VLAN
  2. Calculate VM ID using VLAN + IP scheme
  3. Create container with correct storage selection
  4. Configure SSH key access
  5. Set timezone (America/Los_Angeles) and locale
  6. Install base packages

Result: 30 seconds from request to production-ready container.

Storage Selection Rules
#

NodeStorage PoolTypeWhen to Use
Node 2, 5ssd-dataLVM-thinGeneral workloads, fast I/O
Node 3, 6zdataZFSData integrity critical, databases
AllNever use local-Reserved for Proxmox system

LXC Privilege Levels
#

TypeUse CaseSecurityExamples
UnprivilegedMost containersβœ… RecommendedPi-hole, Semaphore, n8n
PrivilegedSpecial requirements⚠️ Use sparinglyPBS, Docker hosts, NFS

Privileged container requirements:

  • PBS: Raw device access for backups
  • Docker hosts: cgroup access
  • NFS servers: Kernel module access

Resource Management
#

Over-Provisioning Strategy
#

Proxmox allows RAM over-provisioning. My approach:

MetricAllocatedPhysicalRatio
RAM~140 GB96 GB1.5x
CPUVariable16 coresDynamic

Why it works: Containers rarely all peak simultaneously. Monitor with Pulse dashboard to catch issues early.

Template Library
#

Golden image templates accelerate deployment:

TemplateContentsDeploy Time
Base DebianSSH keys, timezone, core packages30 seconds
Docker-readyBase + Docker + Compose45 seconds
Python DevBase + pyenv + common libraries60 seconds

Lessons Learned
#

1. ID Scheme Prevents Chaos
#

Before the VLAN+IP naming convention, I had random VM IDs and constantly forgot which IP belonged to which container. The deterministic scheme eliminated this entirely.

2. Split HA Across Nodes
#

Initially had both DNS containers on adjacent nodes. A single network issue took down both. Now primaries and secondaries are deliberately split.

3. ZFS for Databases
#

Early database containers ran on LVM-thin. After a corruption incident (power loss during write), moved all databases to ZFS with checksumming.

4. Template Everything
#

Creating a new container manually took 30 minutes of package installation and configuration. Templates reduced this to under a minute.

5. Monitor Over-Provisioning
#

RAM over-provisioning works greatβ€”until it doesn’t. The Pulse dashboard caught a memory pressure event before it became an outage.

Related Pages#

There are no articles to list here yet.