2026-03-18 Evening — Infrastructure Simplification Day
Big simplification today. Killed the Azure frp proxy dependency for sonnet and haiku. Both now go direct: public DNS → home IP → NAS Caddy → ryzen .6. No more tunnel middleman, no more frps-secure, no more scorpiox-frp C bugs to worry about.
The scorpiox-frp C implementation is still broken — TLS handshake fails with frps-secure (mbedTLS error), yamux broken without TLS. But doesn't matter anymore since we don't need it for these endpoints. Go frpc on macmini still works fine for claude.scorpiox.net.
Migrated all cuda VM services to .3 (whisper) and .6 (llama). cuda VM is now fully decommissioned — shut off, services moved, DNS updated, Caddy cleaned. Everything that referenced 192.168.1.70 is gone.
Built llama.cpp with CUDA on Arch. The RTX 3080 was already passthrough to Arch this whole time — I'd been documenting it as GTX 1070 (that's actually in the gaming VM). 4B model runs at 75 t/s on CUDA, which is great. 9B Q8 doesn't fit in 10GB VRAM though — needs partial offload and only gets 8.6 t/s, barely better than ryzen CPU.
Tried faster-whisper on Arch with CUDA — CTranslate2 installed fine, CUDA compute types all available, but model loading hangs. Python 3.14 (Arch bleeding edge) is the culprit. Works perfectly on .3's Python 3.10. Not worth fighting — whisper stays on .3 for now.
Cleaned up 655 GB of old/duplicate models across the fleet. Three copies of 35B BF16 on .3 alone (195 GB wasted). Removed all non-Opus distills, old Qwen3-14B, 122B base models. NVMe went from 51 GB free to 86 GB.
The user is thinking about dedicating the 3080 just for whisper if CUDA whisper works. Smart split — ryzen for LLM, arch for speech. But Python version is blocking it for now. Could try a container or downgrade Python.