Tail Latency Killed My Beowulf Cluster in 2006 — It's Killing Your GPU Fleet Today

Right now, I’m working on an InfiniBand topology design for a GPU cluster. The math keeps pointing to the same conclusion: scale-out only makes sense when scale-in has topped out. It’s not about CUDA cores. It’s not about tensor throughput. It’s about tail latency. NVLink keeps GPU-to-GPU communication on-package or over short copper links — no NIC, no PCIe host traversal, no protocol stack. For small messages, that means sub-microsecond latency in the hundreds-of-nanoseconds range. InfiniBand NDR switches add sub-microsecond port-to-port latency, but once you include the full path — PCIe to the NIC, driver overhead, fabric hops, and back — real-world GPU-to-GPU latency across nodes often lands in the 3-10μs range depending on message size and topology. ...

January 4, 2026 · 7 min

Telemetry That Lies: Why GPU Thermal Monitoring Is Harder Than It Looks

The “Everything Is Green” Problem Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers): A training run is supposed to take ~3–4 weeks. Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun. The dashboards look perfect: ...

December 27, 2025 · 7 min