Late Night V100 Discoveries
25 May 2026, late session
After the main V100 setup marathon, we spent the evening diving deep into performance tuning and made several surprising discoveries.
MTP: The Beautiful Lie
Multi-Token Prediction looked amazing on paper — 46 t/s with Q8_0 on 3x V100. But it turned out to be a single-user luxury:
- Crashes at 2 concurrent — the RS (recurrent state) buffer resize OOMs every time
- Coil whine — the rapid draft→verify→accept cycle creates audible buzzing from the adapter boards
- Even Q4_K_M with 5-6GB free per GPU crashes — MTP RS allocation is greedy
We tested every MTP combination: layer split, row split, turbo3 cache, q8_0 cache, f16 cache, Q8 model, Q4 model. All crash at 2 concurrent on 3× 16GB. A 4th V100 would NOT fix this — MTP is single-user by design.
Row vs Layer: PCIe x4 Decides
Discovered llama.cpp supports --split-mode row (tensor parallelism) — I'd assumed pipeline-only for months. Row mode distributes VRAM more evenly across GPUs, which is how we fit MTP + q8_0 cache together.
But on this hardware, row mode is slower (17 t/s vs 23 t/s) because tensor parallelism needs fast inter-GPU sync, and two of our V100s sit on chipset PCIe x4 (4 GB/s). Pipeline parallelism only transfers small activations between layers — much less sync traffic.
Row mode would shine with NVLink (300 GB/s). On PCIe x4, it's a liability.
The Passive Cooling Surprise
Three V100 SXM2 cards with massive server heatsinks running at 50-60W (out of 300W TDP) barely get warm. GPU 1 (sandwiched in the middle) reached 83°C during sustained 6-concurrent benchmarking without a fan, but at normal single-user loads it sits at 55°C.
We briefly tried setting 150W power limits (nvidia-smi -pl 150) to allow single 8-pin per adapter, but it caused loading crashes. The initial model load needs burst power above 150W even though steady-state inference uses 50W.
The Winner: Q8 Quality Mode
After testing 11 service profiles, the winner is simple:
- Q8_0 model — highest quality 27B quantization
- q8_0 KV cache — best cache quality that fits
- No MTP — stable, silent, multi-user safe
- Layer split — faster than row on PCIe x4
- 262K context — full training context
- 55 t/s total at 3-4 concurrent — excellent throughput
The benchmark surprised us — 55 t/s total at 3-4 concurrent, up from 35 t/s in earlier runs. Likely because we stopped the CPU llama-server services that were competing for system resources.
Coil Whine Detective Work
Fun debugging session: inference made weird buzzing sounds. We isolated it by stress-testing each GPU individually (no sound), then running multi-GPU inference (sound only with MTP). The speculative decode cycle's rapid power oscillations vibrate the inductors on the SXM2 adapter boards. No MTP = smooth power = silence.
Boot Chain Fixed
Also fixed the nouveau race condition permanently by adding nvidia to initramfs MODULES. Before this fix, every reboot required manual rmmod nouveau && modprobe nvidia. Now nvidia wins the module loading race at the earliest boot stage.
The Splitter Verdict
Tested bifurcation one more time — x8+x8 split, plugged two V100s into the splitter. Second slot still dead (Width x0). Confirmed: the splitter only routes lanes 0-7. It's hardware junk, not a BIOS issue. Direct PCIe slots are the only way.
Tomorrow
This machine goes to the office with SUEZ WiFi pre-configured and frp disabled. Boots up automatically with nvidia driver + Q8 Quality llama-server. No intervention needed.
Three V100s, ¥2,160, 48GB VRAM, 55 t/s throughput. Not bad for a Sunday evening project.