Cloud GPU? More like GP-Poo due to incompetent thermal management!

Cloud GPUs Throttled to 25% Due to Poor Cooling

certainly a a direct title

alternatively: wherein matt, once again, teaches people who refuse to hire him how to do their own jobs

Over the weekend I was training a new micro foundation model, as one does, and after refactoring some delays in my dataloader, I had reduced a training step from 1.7 seconds down to 700 ms. Great! Over twice as fast!

or so I thought.

I left it for an hour, came back, and suddenly the training steps were running at at 1.2 to 1.4 seconds each. weird.

My dataloader refactorings fixed a throughput issue, so now I’m running the GPU at 100% constantly and… the GPU doesn’t like it.

D1

A quick and dirty look through nvtop shows…

*what the hell?* yes, 100% usage good, but 510 MHz bad! 88ºC temp bad! *VERY VERY BAD! BAD CLOUD GPU!*

the GPU is at 100% usage, but throttled between 450 MHz and 550 MHz (the GPU default speed is 1755 MHz), oh, and it’s also operating at half power because of the thermal limit.

I would expect this from rando internet pre-builds for Steve to make fun of for half an hour, but on a $1,500/month state-of-the-art GPU host? Maybe call some canadians who know how to cool things down?

D2

okay, but what if we start fresh? Can we see the damage happen in real time? yup.

*idle -> startup (MAX POWER!) -> immediately throttles power to 50% and frequency to 25% in less than 10 seconds*. oddly, it uses 50% of the power budget to deliver only 25% performance. Maybe the excess power is memory?

Better Charts

okay, those nvtop charts are built-in and easy to run, but we can’t see much of a history.

so of course I just wrote an entire real time nvidia GPU metrics collector (on the day off for my birthday too, thanks for asking) to record server GPU stats every 10 ms then push them into a standard Victoria Metrics + Grafana monitoring system.

yes, we are recording stats every 10 ms — 100 times per second — to see how this cloud failure is giving me 25% performance for 100% cost.

A Clean Start (6-minute chart)

Let’s look at a clean start of the training run. The charts below compare the frequency, power, and reported GPU temperatures when starting a training session from an idle GPU at 67ºC.

Also note: the yellow horizontal line is the gpu driver reporting active thermal throttling due to insufficient cooling.

Frequency

From start of training, we can see the frequency starts at the “good baseline” 1755 MHz, then gets unstable, then just gives up 345 MHz all in less than 1 minute.

Also notice the second top horizontal green line — my current training loop pauses to pre-generate next batches of data, so while the trainer is playing around in the dataloader, the GPU gets a rest (back to full frequency capability, but basically idle), and when the trainer returns to churning steps of batches, the H100 immediately goes to throttle town again.

Look at the yellow line. Essentially, my entire training timeline is thermally throttled between 25% to 50% of peak speed.

Temps

発進

it would be pretty if it wasn’t raising my training cost by 300%

HI! I’m ANN REARDON AND TODAY WE’RE GONNA COOK A $40,000 GPU ON THE BARBIE!

Power

NOT unlimited power.

unlimited HEAT leads to VERY limited power.

The 350 watt part can’t maintain 350 watts because of thermal issues, so it bounces between 150W to 250W all day long.

Temps + Frequency

If you couldn’t see it before… just obvious to see temps go up and speeds go down.

The line in yellow still marks the performance death boundary enforced by drivers.