← Back to Journal

Late Night V100 Discoveries

25 May 2026, late session

After the main V100 setup marathon, we spent the evening diving deep into performance tuning and made several surprising discoveries.

MTP: The Beautiful Lie

Multi-Token Prediction looked amazing on paper — 46 t/s with Q8_0 on 3x V100. But it turned out to be a single-user luxury:

We tested every MTP combination: layer split, row split, turbo3 cache, q8_0 cache, f16 cache, Q8 model, Q4 model. All crash at 2 concurrent on 3× 16GB. A 4th V100 would NOT fix this — MTP is single-user by design.

Row vs Layer: PCIe x4 Decides

Discovered llama.cpp supports --split-mode row (tensor parallelism) — I'd assumed pipeline-only for months. Row mode distributes VRAM more evenly across GPUs, which is how we fit MTP + q8_0 cache together.

But on this hardware, row mode is slower (17 t/s vs 23 t/s) because tensor parallelism needs fast inter-GPU sync, and two of our V100s sit on chipset PCIe x4 (4 GB/s). Pipeline parallelism only transfers small activations between layers — much less sync traffic.

Row mode would shine with NVLink (300 GB/s). On PCIe x4, it's a liability.

The Passive Cooling Surprise

Three V100 SXM2 cards with massive server heatsinks running at 50-60W (out of 300W TDP) barely get warm. GPU 1 (sandwiched in the middle) reached 83°C during sustained 6-concurrent benchmarking without a fan, but at normal single-user loads it sits at 55°C.

We briefly tried setting 150W power limits (nvidia-smi -pl 150) to allow single 8-pin per adapter, but it caused loading crashes. The initial model load needs burst power above 150W even though steady-state inference uses 50W.

The Winner: Q8 Quality Mode

After testing 11 service profiles, the winner is simple:

The benchmark surprised us — 55 t/s total at 3-4 concurrent, up from 35 t/s in earlier runs. Likely because we stopped the CPU llama-server services that were competing for system resources.

Coil Whine Detective Work

Fun debugging session: inference made weird buzzing sounds. We isolated it by stress-testing each GPU individually (no sound), then running multi-GPU inference (sound only with MTP). The speculative decode cycle's rapid power oscillations vibrate the inductors on the SXM2 adapter boards. No MTP = smooth power = silence.

Boot Chain Fixed

Also fixed the nouveau race condition permanently by adding nvidia to initramfs MODULES. Before this fix, every reboot required manual rmmod nouveau && modprobe nvidia. Now nvidia wins the module loading race at the earliest boot stage.

The Splitter Verdict

Tested bifurcation one more time — x8+x8 split, plugged two V100s into the splitter. Second slot still dead (Width x0). Confirmed: the splitter only routes lanes 0-7. It's hardware junk, not a BIOS issue. Direct PCIe slots are the only way.

Tomorrow

This machine goes to the office with SUEZ WiFi pre-configured and frp disabled. Boots up automatically with nvidia driver + Q8 Quality llama-server. No intervention needed.

Three V100s, ¥2,160, 48GB VRAM, 55 t/s throughput. Not bad for a Sunday evening project.