Optimized Quant between 3bpw and 3.5bpw

#1
by remichu - opened

Hi, just want to shout out that this is awesome work. Will it be possible that u upload an optimized quant at around 3.25bpw for the GPU poor folk? (6x3090)

Of course. Do you have specific requirements for context length as well? Otherwise I'll try to reach 200K context at k5q4 quality.

Though 6*24GB = 144GB so there would be no space for context given that 3bpw is already ~124GB. Anyway I'll try to cook something good for 6x24GB.

Thank for the response, u are right, it will be pushing it. I am not ambitious for full context length, around 128K context is already very good for me. At minimum, i would like to have 64K at least. I do have 1x5090 + 5x3090, so 8GB more for the contexts.

Any chance you could do a 2.25-2.33 bpw quant mr ratsim?
With GLM4.6 I was using turboderp/GLM-4.6-exl3-2.33bpw-opt.
It fits well into my 4090+6000 Blackwell setup.
Also incredible job, I'm using your 2bpw quant and can squeeze in around 60% more context then a GGUF variant.

I've added a 2.10bpw quant with 131072 context that fits in 96GB. I'm quite pleased at how usable it is.

Currently quanting 5bpw and will continue optimized quants afterwards.

It fits even though the file size is larger?

It fits even though the file size is larger?

The file size in HuggingFace is reported in GB but VRAM is in GiB, the difference is a factor 1024x1024x1024/(1000x1000x1000) = 1.0734x

image

@Lockout I just tested the 2.10bpw-tuned briefly on 4x3090 at 16384 ctx, Q6 kv cache and it fits.
ncdu reports 87.9 GiB on the SSD.

k5v4 quants should allow you to reach 131072 context though 4x tensor parallelism might had extra bookkeeping overhead per GPU.

In TabbyAPI's config.yml

# Enable different cache modes for VRAM savings (default: FP16).
# Possible values for exllamav2: 'FP16', 'Q8', 'Q6', 'Q4'.
# For exllamav3, specify the pair k_bits,v_bits where k_bits and v_bits are integers from 2-8 (i.e. 8,8).
cache_mode: 5,4

@remichu I've added a 3.15bpw which should allow 102400 context size (maybe less, unsure about the overhead on 6x GPUs).

In the process I somehow cooked a Qwen-like quant that kept saying "Wait" in its reasoning trace for coding, and kept changing its mind in creative writing:

image

image

@zz2g , I'm currently trying to cook a 2.62bpw model (108GiB, allowing 131K context on 128GiB systems and about 100K on 24+96GiB systems like yours) but somehow the current cook I have has refusal and jailbreak checks, so postponing.

image

Huh.. so we can set separate K and V now, TIL

Thanks let me test 3.15bpw. I was testing with 3.0bpw and it is cohenrent. Btw do u have plan to quant minimax m2.1? It is just out and it is neck to neck vs glm

Huh.. so we can set separate K and V now, TIL

Yes exl3 give alot od options for tweaking

@zz2g I've created a 2.57bpw quant (105GiB) that should allow a decent 102400 context size on 120GiB VRAM (24+96) with k5v4 KV-cache quant.

Tell me if this works for you or if I need something even smaller so you can fit 200K context.

Is K/V same as it is in llama.cpp where Key should be less quanted than value? Going to try to squeeze at least 32k.

From https://github.com/turboderp-org/exllamav3/issues/1#issuecomment-2826132438 it seems like it's better to reduce v before reducing k

@remichu

Btw do u have plan to quant minimax m2.1? It is just out and it is neck to neck vs glm

For now I don't. While I'm a dev, AI does not help me either for work, Rust-based state-of-the-art cryptography (i.e. implementing papers a few months old), compiler engineering or for hobby projects (in Nim: cryptography, compiler, high-performance computing and deep learning dev) except for table-stake like documentation.

And for the rest of my use-cases, I need a general-purpose model with strong French, bio-medical capabilities, able to parse French legal jargon with excellent general knowledge, pop culture and able to write stories and scenarios.

Another model I'm interested in is MiMo-V2-Flash but it cannot be supported in ExllamaV3 at the moment: https://github.com/turboderp-org/exllamav3/issues/124

@zz2g I've created a 2.57bpw quant (105GiB) that should allow a decent 102400 context size on 120GiB VRAM (24+96) with k5v4 KV-cache quant.

Tell me if this works for you or if I need something even smaller so you can fit 200K context.

I'll give it a try! Thanks a lot! And also thanks to Turboderp for exl3 and Z.AI for releasing a model on par with the best closed models to the public!
Will report my findings!
Btw, what you reported is a complaint I read about before for GLM4.7. It has a particular censorship built in to counteract what it believes are (Jailbreaks.)
You might just got unlucky and hit that particular spot within it's latents.

Ahh ok.. so it behaves just like llama.cpp cache quantization. Higher precision key, lower value. Must be a universality.

Unfortunately with NCCL, 32k barely fits on 96gb. Maybe native TP is better. Even with cache quanted.

image

native TP not much diff

image

gpu_split: [22.5,23,23,23] #GLM
and export PYTORCH_ALLOC_CONF=expandable_segments:True

chunk_size: 1024 helps prevent OOM but no way it will fit 100k.

For the 2.10bpw quant?

Are you running Xorg/Wayland on those cards as well?

It might be that the overhead of NCCL is greater than I imagined on 4 cards. I tested on a single 96GiB RTX pro 6000 and I could load the quant.

Ok, tested the 2.57bpw quant, I'm just amazed how good this model is. It can handle 10 characters at the same time, I'm flabbergasted.

@mratsim

somehow the current cook I have has refusal and jailbreak checks, so postponing

Hey mate, any theories about this one? I'm noticing the same thing with certain <4bpw gguf quants.

These two: ubergarm/GLM-4.7-GGUF IQ3_KS and unsloth/GLM-4.6-GGUF UD-Q3_K_XL specifically.

I'm trying to tweak a compassion vs sadism control-vector (trying to get prompt stems that don't amplify slop) for this model, and tend to use smaller quants to experiment since I have to generate 15360 sample for each tweak I make.

BUT with this model, the <4bpw quants seem to reason about safety, or simply refuse with thinking disabled.

IQ3_KS:

image

Normally I'd expect the Max Ξ” to be around layer 45 for this vector.

Q8_0:

image

And an example ridiculous refusal:

image

No idea unfortunately. Are you sure about GLM4.6 though or is it a typo? That's the first time I'm hearing about refusals for it.

My cards are empty. No xorg. There is another 250mb process for NCCL and native TP. As to censorship.. I have not run into much using text completion and . Yea the model is more gentle but it still does NSFW and gore. If a refusal comes up, I regenerate and it usually will bypass.

Are you sure about GLM4.6 though or is it a typo?

You're right, that was a typo, I meant their 4.7 -_-!

Anyway, not really related to this EXL3 quant, I just thought you might have found something since once of your smaller quants amplified refusals.
I've never seen this happen before with any other models.

If a refusal comes up, I regenerate and it usually will bypass.

The refusals seem quant related for me. I can stick with q8_0 for now, it'll just be slower.

using text completion

Yeah my datasets are pre-formatted so equivalent to text completion.

Sign up or log in to comment