Telemetry That Lies: Why GPU Thermal Monitoring Is Harder Than It Looks

The “Everything Is Green” Problem Here’s a realistic scenario I’ve seen in different forms across fleets (this is a composite, not a single true story with exact numbers): A training run is supposed to take ~3–4 weeks. Two weeks in, someone notices the timeline slipping. Not a crash. Not a failure. Just… slow. The job is running 10–30% behind plan, and nobody can point to a smoking gun. The dashboards look perfect: ...

December 27, 2025 · 7 min

Predictive Power Conditioning for GPU Clusters

GPU clusters don’t fail from sustained load. They fail on transitions. A pod idling at 20 kW can step toward 300 kW quickly when training begins. The peak matters, but the killer is the step: the dP/dt that forces every layer of the electrical path to react at once. Thermals matter too—but they’re secondary and collateral. Power transients can push protection and control behavior in cycles. Thermal consequences show up later as throttling, efficiency loss, and “mysteriously slower training” that looks like a software problem until you instrument the facility. ...

December 18, 2025 · 5 min

HVAC Doesn't Create Cold — It Removes Heat

This is the first of a series of URE articles about thermal management in data center environments—not theory, not “best practices,” but what actually happens when heat meets physics and scale. Here’s a simple puzzle from two idle machines. ai01 — home lab, Threadripper 32-core with 2× NVIDIA GPUs (NVLink), rack-level liquid cooling loop, used for ML training and vLLM inference: Tctl: +33.0°C Tccd1: +33.2°C Tccd5: +31.5°C nj01 — third-party datacenter (colo), Ryzen 12-core, air-cooled: ...

December 7, 2025 · 4 min