Skip to main content

Day 2 — Operate

Day 2 is everything that happens after the initial deployment — keeping the enclave healthy, backed up, and diagnosable.

Operations Areas

TopicDocument
Dashboards, metrics, health monitoringMonitoring
Backups, config management, AnsibleBackup & Maintenance
Common issues, kubectl snippets, cert debuggingTroubleshooting

Operational Principles

Know Your Blast Radius

The enclave is a 3-node Harvester cluster using etcd for distributed state. Understand the failure modes:

FailureImpactRecovery
1 Harvester node downCluster continues (2/3 quorum)Bring node back up; auto-recovers
2+ Harvester nodes downCluster halted (no quorum)Manual etcd recovery required
infra-01 (DHCP/DNS) downNew DHCP leases fail; DNS failsRestart VM; existing connections survive
infra-02 (HAProxy/Keepalived) downVIPs go offlineRestart VM or promote backup
nuc-00 downinfra VMs offline; PXE unavailableBring nuc-00 back up

Change Management

Treat the enclave like production:

  • Test changes on a single node before rolling to all three
  • Keep the Ansible playbooks in version control; commit before and after changes
  • Document any manual changes immediately — the next person (future you) needs to know

Patching Cadence

ComponentCadenceMethod
openSUSE Leap 15.5 (nuc-00, VMs)Monthlyzypper update -y via Ansible
HarvesterPer releaseHarvester UI upgrade wizard
Rancher ManagerPer releasehelm upgrade
K3s (on rancher-mgr)With RancherAuto or manual
cert-managerQuarterlyhelm upgrade

Day 2 Runbooks

Quick reference for common tasks:

# Check overall cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed

# Check Rancher pods
kubectl --kubeconfig ~/.kube/rancher-k3s-config get pods -n cattle-system

# Restart a stuck VM in Harvester
virtctl restart <vm-name> -n <namespace>

# Force-drain a Harvester node for maintenance
kubectl drain nuc-02 --ignore-daemonsets --delete-emptydir-data
# After maintenance:
kubectl uncordon nuc-02