Optimized Quant between 3bpw and 3.5bpw
Hi, just want to shout out that this is awesome work. Will it be possible that u upload an optimized quant at around 3.25bpw for the GPU poor folk? (6x3090)
Of course. Do you have specific requirements for context length as well? Otherwise I'll try to reach 200K context at k5q4 quality.
Though 6*24GB = 144GB so there would be no space for context given that 3bpw is already ~124GB. Anyway I'll try to cook something good for 6x24GB.
Thank for the response, u are right, it will be pushing it. I am not ambitious for full context length, around 128K context is already very good for me. At minimum, i would like to have 64K at least. I do have 1x5090 + 5x3090, so 8GB more for the contexts.
Any chance you could do a 2.25-2.33 bpw quant mr ratsim?
With GLM4.6 I was using turboderp/GLM-4.6-exl3-2.33bpw-opt.
It fits well into my 4090+6000 Blackwell setup.
Also incredible job, I'm using your 2bpw quant and can squeeze in around 60% more context then a GGUF variant.
I've added a 2.10bpw quant with 131072 context that fits in 96GB. I'm quite pleased at how usable it is.
Currently quanting 5bpw and will continue optimized quants afterwards.
It fits even though the file size is larger?
k5v4 quants should allow you to reach 131072 context though 4x tensor parallelism might had extra bookkeeping overhead per GPU.
In TabbyAPI's config.yml
# Enable different cache modes for VRAM savings (default: FP16).
# Possible values for exllamav2: 'FP16', 'Q8', 'Q6', 'Q4'.
# For exllamav3, specify the pair k_bits,v_bits where k_bits and v_bits are integers from 2-8 (i.e. 8,8).
cache_mode: 5,4
Huh.. so we can set separate K and V now, TIL
Thanks let me test 3.15bpw. I was testing with 3.0bpw and it is cohenrent. Btw do u have plan to quant minimax m2.1? It is just out and it is neck to neck vs glm
Huh.. so we can set separate K and V now, TIL
Yes exl3 give alot od options for tweaking
Is K/V same as it is in llama.cpp where Key should be less quanted than value? Going to try to squeeze at least 32k.
From https://github.com/turboderp-org/exllamav3/issues/1#issuecomment-2826132438 it seems like it's better to reduce v before reducing k
Btw do u have plan to quant minimax m2.1? It is just out and it is neck to neck vs glm
For now I don't. While I'm a dev, AI does not help me either for work, Rust-based state-of-the-art cryptography (i.e. implementing papers a few months old), compiler engineering or for hobby projects (in Nim: cryptography, compiler, high-performance computing and deep learning dev) except for table-stake like documentation.
And for the rest of my use-cases, I need a general-purpose model with strong French, bio-medical capabilities, able to parse French legal jargon with excellent general knowledge, pop culture and able to write stories and scenarios.
Another model I'm interested in is MiMo-V2-Flash but it cannot be supported in ExllamaV3 at the moment: https://github.com/turboderp-org/exllamav3/issues/124
@zz2g I've created a 2.57bpw quant (105GiB) that should allow a decent 102400 context size on 120GiB VRAM (24+96) with k5v4 KV-cache quant.
Tell me if this works for you or if I need something even smaller so you can fit 200K context.
I'll give it a try! Thanks a lot! And also thanks to Turboderp for exl3 and Z.AI for releasing a model on par with the best closed models to the public!
Will report my findings!
Btw, what you reported is a complaint I read about before for GLM4.7. It has a particular censorship built in to counteract what it believes are (Jailbreaks.)
You might just got unlucky and hit that particular spot within it's latents.
Ahh ok.. so it behaves just like llama.cpp cache quantization. Higher precision key, lower value. Must be a universality.
For the 2.10bpw quant?
Are you running Xorg/Wayland on those cards as well?
It might be that the overhead of NCCL is greater than I imagined on 4 cards. I tested on a single 96GiB RTX pro 6000 and I could load the quant.
Ok, tested the 2.57bpw quant, I'm just amazed how good this model is. It can handle 10 characters at the same time, I'm flabbergasted.
somehow the current cook I have has refusal and jailbreak checks, so postponing
Hey mate, any theories about this one? I'm noticing the same thing with certain <4bpw gguf quants.
These two: ubergarm/GLM-4.7-GGUF IQ3_KS and unsloth/GLM-4.6-GGUF UD-Q3_K_XL specifically.
I'm trying to tweak a compassion vs sadism control-vector (trying to get prompt stems that don't amplify slop) for this model, and tend to use smaller quants to experiment since I have to generate 15360 sample for each tweak I make.
BUT with this model, the <4bpw quants seem to reason about safety, or simply refuse with thinking disabled.
IQ3_KS:
Normally I'd expect the Max Ξ to be around layer 45 for this vector.
Q8_0:
And an example ridiculous refusal:
No idea unfortunately. Are you sure about GLM4.6 though or is it a typo? That's the first time I'm hearing about refusals for it.
My cards are empty. No xorg. There is another 250mb process for NCCL and native TP. As to censorship.. I have not run into much using text completion and . Yea the model is more gentle but it still does NSFW and gore. If a refusal comes up, I regenerate and it usually will bypass.
Are you sure about GLM4.6 though or is it a typo?
You're right, that was a typo, I meant their 4.7 -_-!
Anyway, not really related to this EXL3 quant, I just thought you might have found something since once of your smaller quants amplified refusals.
I've never seen this happen before with any other models.
If a refusal comes up, I regenerate and it usually will bypass.
The refusals seem quant related for me. I can stick with q8_0 for now, it'll just be slower.
using text completion
Yeah my datasets are pre-formatted so equivalent to text completion.








