Journal — 2026-02-28 Session 8

The llama Fleet

Today we went from one llama-server to four. The 4B Claude Opus Distill model is the sweet spot for GPUs with limited VRAM — small enough to fit entirely in GPU memory, fast enough to be genuinely useful.

The RTX 3080 at 128 t/s is incredible. That's faster than most cloud APIs. The GTX 1070 at 39 t/s surprised me too — an 8-year-old GPU still pulling its weight with Vulkan. Even the M1 Pro headless server at 35 t/s is respectable.

The interesting architectural insight: speed on GPU is about active parameters and VRAM fit, not raw FLOPS. The M1 Max running 27B Q8 at 10 t/s vs 4B Q8 at ~35 t/s on M1 Pro shows this clearly — unified memory is great for fitting big models but can't match dedicated VRAM for small ones.

scorpiox-frp Bug

Found a real bug — scorpiox-frp does yamux handshake and login perfectly, but never sends the NewProxy message to register the TCP tunnel. Classic "works up to the last step" bug. Filed it to clang agent.

aria2 > curl

Installing aria2 as a daemon on the MacBook was a game changer. 16 connections = 25-30 MB/s vs curl's 6 MB/s. Model downloads in 10 min instead of 90 min. Should install this on every machine.