Stable run on 2x RTX 5090 (32GB) with ik_llama.cpp - 6.1 t/s
First of all, a huge thank you to Ubergarm for this high-quality IQ4_K quantization. It works beautifully!
I managed to find the "sweet spot" for running this massive GLM-4.7 MoE model on a dual consumer GPU setup. Here are my results and the specific configuration to maximize VRAM usage without OOM crashes.
Hardware Configuration:
GPUs: 2x NVIDIA RTX 5090 (32GB VRAM each)
CPU: 2 xeon E5 2696 V4 ( witout avx512 ), CUDA 12.8 , Nvidia Driver Version: 580.65.06
RAM: DDR4 400 GO ( LXC in pve Proxmox 9 )
Software: ik_llama.cpp with https://github.com/ikawrakow/ik_llama.cpp/pull/1080
Performance:
Generation Speed: ~6.1 t/s
Prompt Processing: ~16.4 t/s
VRAM Usage: ~31GB per card (95% utilization, rock solid)
Since the full model doesn't fit in 64GB of VRAM, I used manual tensor overrides (-ot) to force exactly 17 layers of experts onto the GPUs (10 on GPU0, 7 on GPU1), while keeping the rest on CPU RAM. The KV cache is compressed to Q4_0 to save space.
numactl --interleave=all ./build/bin/llama-server
--model GLM-4.7-IQ4_K-00001-of-00006.gguf
--ctx-size 131072
--threads 64 --threads-batch 64
--n-gpu-layers 99
--tensor-split 0.5,0.5
--cache-type-k q4_0
--cache-type-v q4_0
\
Force layers 0-9 experts to CUDA0 (~21GB)
-ot 'blk.[0-9]..*exps.weight=CUDA0'
\
Force layers 10-16 experts to CUDA1 (~15GB + overhead)
-ot 'blk.1[0-6]..*exps.weight=CUDA1'
\
Send all remaining experts (layers 17-92) to CPU
-ot '.*exps.weight=CPU'
If you have more idea for best result
Glad you're getting some success! Tuning your exact parameters for your rig is part of the fun these days haha...
My initial thoughts are:
- Go ahead and build from tip of main now as that PR is merged up so you'll get all the latest goodies as they arrive.
- If you're going all the way down to q4_0 for kv-cache consider trying hadamard k cache stuff from these PRs:
- https://github.com/ikawrakow/ik_llama.cpp/pull/1033
- https://github.com/ikawrakow/ik_llama.cpp/pull/1034
- basically try maybe
--k-cache-hadamard -ctk q4_0 -ctv q5_0or similar for fun
- If you can run bare metal instead of through proxmox it might help
- your NUMA situation is gonna be one of the biggest considerations e.g. BIOS config etc, too much to list here, there is a lot of chatter on it but you can try adding
--numa numactlto your command too etc. - Those CPUs each have only 22 physical cores each psure, likely try
--threads 44 --threads-batch 44or play with the numbers - tho on intel it can be cpu limited and maybe the SMT does help, you'll have to enchmark across a bunch of values to see - consider using
--n-cpu-moe 72 -ts 30,32or similar as offload strategy can make a difference sometimes if slower PCIe speeds. (or change your 10-16 experts to be the last ~8 layers of for the CUDA1) - there is a PR discussion about that with ik and me, can't find it right now lol. (72 should be total number of layers (like 91ish) minus however many you want on GPU) - Consider benchmarking with
llama-sweep-benchand making graphs or having some way to compare all your benchmark runs across the desired context length - You'll probably not use that much context at those speeds so consider dropping down context to like 65k and increasing batch sizes e.g.
-ub 4096 -b 4096for sure
That's enough to keep you busy for the rest of the year! ;p Have fun!