← Back to work
Two-Node Multi-GPU Inference Cluster
A self-built two-node multi-GPU cluster serving local models with a privacy-first architecture, thermal-aware autonomy, and GPU-wedge diagnosis and recovery.
- 2 nodes
- multi-GPU cluster measured
- VRAM-bound
- KV-cache / context tuning measured
- auto-recover
- GPU-wedge diagnosis measured
Context
Serving capable local models reliably means living inside hard VRAM and thermal limits — and recovering gracefully when a card wedges. I built the cluster and the tooling to do exactly that.
What I built
- A two-node cluster — a Windows workstation (RTX 5090) paired with a Linux agent machine (RTX 3090) — serving local models over llama.cpp / Ollama.
- VRAM-bound tuning of KV-cache and context windows to fit capable context lengths inside a hard memory ceiling without wedging the fragile card.
- Thermal-aware autonomy that backs off before the abort line, plus
gpu_doctor.py— structured GPU diagnostics (nvidia-smi analysis, VRAM mapping, machine-consumable JSON output) for fast fault diagnosis. - A privacy-first network — SSH tunnels and scoped port routing, with the agent machine holding only read-only access to shared services.
Why it matters
It is the hardware substrate the rest of the stack runs on — and a hands-on lesson in reliability under real physical constraints.