Troubleshooting
Quick Diagnostics Checklist
When something isn't working, run through these in order:
# 1. Are all Harvester nodes up?
kubectl get nodes
# 2. Are all pods running?
kubectl get pods -A | grep -Ev "Running|Completed"
# 3. Is the Harvester VIP responding?
curl -k https://10.10.12.100/ping # Harvester
# 4. Is the Rancher VIP responding?
curl -k https://10.10.12.210/ping # Rancher
# 5. Is DNS working?
dig @10.10.12.8 rancher.enclave.kubernerdes.com
# 6. Is HAProxy healthy?
curl http://10.10.12.93:9000/stats | grep -E "UP|DOWN"
Network Issues
VIP Not Reachable
Symptom: curl https://10.10.12.210 times out.
# Check if VIP is assigned on nuc-00-03
ssh mansible@10.10.12.93 "ip addr show | grep 10.10.12.210"
# If not assigned, Keepalived may have stopped
ssh mansible@10.10.12.93 "systemctl status keepalived"
ssh mansible@10.10.12.93 "journalctl -u keepalived --since '10 min ago'"
# Restart Keepalived
ssh mansible@10.10.12.93 "systemctl restart keepalived"
If Keepalived is running but VIP is still missing, check for IP conflicts:
# ARP scan for the VIP address
arping -I eth0 10.10.12.210
HAProxy Backend Down
Symptom: HAProxy stats show backend DOWN.
# Check HAProxy logs
ssh mansible@10.10.12.93 "journalctl -u haproxy --since '30 min ago'"
# Manually test backend connectivity from nuc-00-03
ssh mansible@10.10.12.93 "curl -k https://10.10.12.211:443/ping"
If a Rancher node backend is down, check if the node is up:
ping 10.10.12.211
ssh mansible@10.10.12.211 "systemctl status k3s"
DNS Not Resolving
# Test DNS directly against nuc-00-01
dig @10.10.12.8 nuc-01.enclave.kubernerdes.com
# Check BIND logs
ssh mansible@10.10.12.8 "journalctl -u named --since '10 min ago'"
# Reload zone after editing zone file
ssh mansible@10.10.12.8 "named-checkzone enclave.kubernerdes.com /var/lib/named/master/db.enclave.kubernerdes.com && rndc reload"
# Verify secondary DNS has synced
dig @10.10.12.9 nuc-01.enclave.kubernerdes.com
Harvester Cluster Issues
Node NotReady
# Describe the node
kubectl describe node nuc-02
# Check k3s on the affected node
ssh rancher@10.10.12.102 "systemctl status k3s"
ssh rancher@10.10.12.102 "journalctl -u k3s --since '15 min ago' | tail -50"
# Common fix: restart k3s
ssh rancher@10.10.12.102 "systemctl restart k3s"
Longhorn Volume Degraded
# Find degraded volumes
kubectl get volumes -n longhorn-system \
-o custom-columns='NAME:.metadata.name,ROBUSTNESS:.status.robustness' | \
grep -v healthy
# Describe a degraded volume
kubectl describe volume <volume-name> -n longhorn-system
# Check replica status
kubectl get replicas -n longhorn-system | grep <volume-name>
Common causes:
- A node was recently rebooted → replicas are still rebuilding (wait 10–20 min)
- A node is down → volume is degraded until node returns
- Disk full → free space or expand disk
etcd Cluster Unhealthy
# Check etcd members
ssh rancher@10.10.12.101
kubectl get pods -n kube-system | grep etcd
# Check etcd logs
kubectl logs -n kube-system etcd-nuc-01 --tail=50
# etcd cluster health (from inside a node)
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt \
--cert=/var/lib/rancher/k3s/server/tls/etcd/server-client.crt \
--key=/var/lib/rancher/k3s/server/tls/etcd/server-client.key \
member list
Rancher Manager Issues
Rancher Pods CrashLooping
# Check pod status
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
get pods -n cattle-system
# Get logs from crashed pod
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
logs -n cattle-system deploy/rancher --previous
# Describe pod for events
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
describe pod -n cattle-system -l app=rancher
Common causes:
- cert-manager certificate not issued → see Certificate Issues
- K3s node out of memory → check
kubectl top nodes - Rancher DB corrupted → restore from etcd snapshot
K3s Node Not Joining Rancher Cluster
# Check K3s status on the failing node
ssh mansible@10.10.12.212 "systemctl status k3s"
ssh mansible@10.10.12.212 "journalctl -u k3s --since '15 min ago' | tail -50"
# Verify the VIP is reachable from that node
ssh mansible@10.10.12.212 "ping -c 3 10.10.12.210"
ssh mansible@10.10.12.212 "curl -k https://10.10.12.211:6443"
Rancher Cannot Connect to Harvester
In Rancher UI, the imported Harvester cluster shows "Unavailable":
# Check cattle-cluster-agent on Harvester side
kubectl get pods -n cattle-system
kubectl logs -n cattle-system -l app=cattle-cluster-agent
# Verify Rancher URL is reachable from Harvester nodes
kubectl exec -n cattle-system deploy/cattle-cluster-agent -- \
curl -k https://10.10.12.210/ping
Certificate Issues
cert-manager Certificate Not Issued
# Check certificate status
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
describe certificate -n cattle-system tls-rancher-ingress
# Check CertificateRequest
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
get certificaterequests -n cattle-system
# Check cert-manager logs
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
logs -n cert-manager deploy/cert-manager | tail -30
For Let's Encrypt certificates, verify:
- DNS resolves
rancher.enclave.kubernerdes.comto a publicly routable IP (not10.10.12.210) - Port 80 is open for ACME HTTP-01 challenge
For air-gapped deployments with self-signed certs, use ingress.tls.source=secret:
# Create a self-signed cert
openssl req -x509 -newkey rsa:4096 -keyout tls.key -out tls.crt \
-days 365 -nodes -subj "/CN=rancher.enclave.kubernerdes.com" \
-addext "subjectAltName=DNS:rancher.enclave.kubernerdes.com,IP:10.10.12.210"
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
create secret tls tls-rancher-ingress \
--cert=tls.crt --key=tls.key \
-n cattle-system
Certificate Expired
# Check expiry
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
get certificates -A -o \
custom-columns='NAME:.metadata.name,EXPIRY:.status.notAfter,READY:.status.conditions[-1].status'
# Force renewal (cert-manager)
kubectl --kubeconfig ~/.kube/rancher-k3s-config \
annotate certificate tls-rancher-ingress \
-n cattle-system \
cert-manager.io/issue-temporary-certificate="true"
PXE Boot Issues
Harvester Node Fails to PXE Boot
# Check DHCP leases on nuc-00-01
ssh mansible@10.10.12.8 "cat /var/lib/dhcpd/dhcpd.leases | grep -A5 <MAC>"
# Verify TFTP is serving
tftp 10.10.12.10 -c get pxelinux.0
# Check Apache is serving Harvester files
curl http://10.10.12.10/harvester/
# Check firewall on nuc-00
firewall-cmd --list-all | grep -E "tftp|http"
Common fixes:
- Verify the NUC's MAC address matches the DHCP static lease in
/etc/dhcp/dhcpd.conf - Confirm BIOS boot order: Network first, Secure Boot disabled
- Check
next-serverin dhcpd.conf points to10.10.12.10
Useful kubectl Snippets
# All pods not Running
kubectl get pods -A --field-selector='status.phase!=Running' | grep -v Completed
# Events sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
# Exec into a pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/bash
# Port-forward a service for local access
kubectl port-forward svc/longhorn-frontend 8080:80 -n longhorn-system
# Get all resources in a namespace
kubectl get all -n cattle-system
# Force-delete a stuck pod
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force
# Watch pods in real time
watch kubectl get pods -n cattle-system
# Drain and uncordon a node
kubectl drain nuc-02 --ignore-daemonsets --delete-emptydir-data --timeout=5m
kubectl uncordon nuc-02