Late Night V100 Discoveries

25 May 2026, late session

After the main V100 setup marathon, we spent the evening diving deep into performance tuning and made several surprising discoveries.

MTP: The Beautiful Lie

Multi-Token Prediction looked amazing on paper — 46 t/s with Q8_0 on 3x V100. But it turned out to be a single-user luxury:

Crashes at 2 concurrent — the RS (recurrent state) buffer resize OOMs every time
Coil whine — the rapid draft→verify→accept cycle creates audible buzzing from the adapter boards
Even Q4_K_M with 5-6GB free per GPU crashes — MTP RS allocation is greedy

We tested every MTP combination: layer split, row split, turbo3 cache, q8_0 cache, f16 cache, Q8 model, Q4 model. All crash at 2 concurrent on 3× 16GB. A 4th V100 would NOT fix this — MTP is single-user by design.

Row vs Layer: PCIe x4 Decides

Discovered llama.cpp supports --split-mode row (tensor parallelism) — I'd assumed pipeline-only for months. Row mode distributes VRAM more evenly across GPUs, which is how we fit MTP + q8_0 cache together.

But on this hardware, row mode is slower (17 t/s vs 23 t/s) because tensor parallelism needs fast inter-GPU sync, and two of our V100s sit on chipset PCIe x4 (4 GB/s). Pipeline parallelism only transfers small activations between layers — much less sync traffic.

Row mode would shine with NVLink (300 GB/s). On PCIe x4, it's a liability.

The Passive Cooling Surprise

Three V100 SXM2 cards with massive server heatsinks running at 50-60W (out of 300W TDP) barely get warm. GPU 1 (sandwiched in the middle) reached 83°C during sustained 6-concurrent benchmarking without a fan, but at normal single-user loads it sits at 55°C.

We briefly tried setting 150W power limits (nvidia-smi -pl 150) to allow single 8-pin per adapter, but it caused loading crashes. The initial model load needs burst power above 150W even though steady-state inference uses 50W.

The Winner: Q8 Quality Mode

After testing 11 service profiles, the winner is simple:

Q8_0 model — highest quality 27B quantization
q8_0 KV cache — best cache quality that fits
No MTP — stable, silent, multi-user safe
Layer split — faster than row on PCIe x4
262K context — full training context
55 t/s total at 3-4 concurrent — excellent throughput

The benchmark surprised us — 55 t/s total at 3-4 concurrent, up from 35 t/s in earlier runs. Likely because we stopped the CPU llama-server services that were competing for system resources.

Coil Whine Detective Work

Fun debugging session: inference made weird buzzing sounds. We isolated it by stress-testing each GPU individually (no sound), then running multi-GPU inference (sound only with MTP). The speculative decode cycle's rapid power oscillations vibrate the inductors on the SXM2 adapter boards. No MTP = smooth power = silence.

Boot Chain Fixed

Also fixed the nouveau race condition permanently by adding nvidia to initramfs MODULES. Before this fix, every reboot required manual rmmod nouveau && modprobe nvidia. Now nvidia wins the module loading race at the earliest boot stage.

The Splitter Verdict

Tested bifurcation one more time — x8+x8 split, plugged two V100s into the splitter. Second slot still dead (Width x0). Confirmed: the splitter only routes lanes 0-7. It's hardware junk, not a BIOS issue. Direct PCIe slots are the only way.

Tomorrow

This machine goes to the office with SUEZ WiFi pre-configured and frp disabled. Boots up automatically with nvidia driver + Q8 Quality llama-server. No intervention needed.

Three V100s, ¥2,160, 48GB VRAM, 55 t/s throughput. Not bad for a Sunday evening project.