mratsim/GLM-4.7-EXL3 · Optimized Quant between 3bpw and 3.5bpw

4 days ago

Hi, just want to shout out that this is awesome work. Will it be possible that u upload an optimized quant at around 3.25bpw for the GPU poor folk? (6x3090)

mratsim

Owner 4 days ago

•

edited 4 days ago

Of course. Do you have specific requirements for context length as well? Otherwise I'll try to reach 200K context at k5q4 quality.

Though 6*24GB = 144GB so there would be no space for context given that 3bpw is already ~124GB. Anyway I'll try to cook something good for 6x24GB.

remichu

4 days ago

Thank for the response, u are right, it will be pushing it. I am not ambitious for full context length, around 128K context is already very good for me. At minimum, i would like to have 64K at least. I do have 1x5090 + 5x3090, so 8GB more for the contexts.

zz2g

3 days ago

Any chance you could do a 2.25-2.33 bpw quant mr ratsim?
With GLM4.6 I was using turboderp/GLM-4.6-exl3-2.33bpw-opt.
It fits well into my 4090+6000 Blackwell setup.
Also incredible job, I'm using your 2bpw quant and can squeeze in around 60% more context then a GGUF variant.

mratsim

Owner 3 days ago

I've added a 2.10bpw quant with 131072 context that fits in 96GB. I'm quite pleased at how usable it is.

Currently quanting 5bpw and will continue optimized quants afterwards.

Lockout

3 days ago

It fits even though the file size is larger?

mratsim

Owner 3 days ago

It fits even though the file size is larger?

The file size in HuggingFace is reported in GB but VRAM is in GiB, the difference is a factor 1024x1024x1024/(1000x1000x1000) = 1.0734x

gghfez

3 days ago

@Lockout I just tested the 2.10bpw-tuned briefly on 4x3090 at 16384 ctx, Q6 kv cache and it fits.
ncdu reports 87.9 GiB on the SSD.

mratsim

Owner 3 days ago

k5v4 quants should allow you to reach 131072 context though 4x tensor parallelism might had extra bookkeeping overhead per GPU.

In TabbyAPI's config.yml

# Enable different cache modes for VRAM savings (default: FP16).
# Possible values for exllamav2: 'FP16', 'Q8', 'Q6', 'Q4'.
# For exllamav3, specify the pair k_bits,v_bits where k_bits and v_bits are integers from 2-8 (i.e. 8,8).
cache_mode: 5,4

mratsim

Owner 3 days ago

@remichu I've added a 3.15bpw which should allow 102400 context size (maybe less, unsure about the overhead on 6x GPUs).

In the process I somehow cooked a Qwen-like quant that kept saying "Wait" in its reasoning trace for coding, and kept changing its mind in creative writing:

mratsim

Owner 2 days ago

@zz2g , I'm currently trying to cook a 2.62bpw model (108GiB, allowing 131K context on 128GiB systems and about 100K on 24+96GiB systems like yours) but somehow the current cook I have has refusal and jailbreak checks, so postponing.

Lockout

2 days ago

Huh.. so we can set separate K and V now, TIL

remichu

2 days ago

Thanks let me test 3.15bpw. I was testing with 3.0bpw and it is cohenrent. Btw do u have plan to quant minimax m2.1? It is just out and it is neck to neck vs glm

remichu

2 days ago

Huh.. so we can set separate K and V now, TIL

Yes exl3 give alot od options for tweaking

mratsim

Owner 2 days ago

@zz2g I've created a 2.57bpw quant (105GiB) that should allow a decent 102400 context size on 120GiB VRAM (24+96) with k5v4 KV-cache quant.

Tell me if this works for you or if I need something even smaller so you can fit 200K context.

Lockout

2 days ago

Is K/V same as it is in llama.cpp where Key should be less quanted than value? Going to try to squeeze at least 32k.

mratsim

Owner 2 days ago

From https://github.com/turboderp-org/exllamav3/issues/1#issuecomment-2826132438 it seems like it's better to reduce v before reducing k

mratsim

Owner 2 days ago

@remichu

Btw do u have plan to quant minimax m2.1? It is just out and it is neck to neck vs glm

For now I don't. While I'm a dev, AI does not help me either for work, Rust-based state-of-the-art cryptography (i.e. implementing papers a few months old), compiler engineering or for hobby projects (in Nim: cryptography, compiler, high-performance computing and deep learning dev) except for table-stake like documentation.

And for the rest of my use-cases, I need a general-purpose model with strong French, bio-medical capabilities, able to parse French legal jargon with excellent general knowledge, pop culture and able to write stories and scenarios.

Another model I'm interested in is MiMo-V2-Flash but it cannot be supported in ExllamaV3 at the moment: https://github.com/turboderp-org/exllamav3/issues/124

zz2g

1 day ago

•

edited 1 day ago

@zz2g I've created a 2.57bpw quant (105GiB) that should allow a decent 102400 context size on 120GiB VRAM (24+96) with k5v4 KV-cache quant.

Tell me if this works for you or if I need something even smaller so you can fit 200K context.

I'll give it a try! Thanks a lot! And also thanks to Turboderp for exl3 and Z.AI for releasing a model on par with the best closed models to the public!
Will report my findings!
Btw, what you reported is a complaint I read about before for GLM4.7. It has a particular censorship built in to counteract what it believes are (Jailbreaks.)
You might just got unlucky and hit that particular spot within it's latents.

Lockout

1 day ago

Ahh ok.. so it behaves just like llama.cpp cache quantization. Higher precision key, lower value. Must be a universality.

Lockout

about 13 hours ago

•

edited about 13 hours ago

Unfortunately with NCCL, 32k barely fits on 96gb. Maybe native TP is better. Even with cache quanted.

native TP not much diff

gpu_split: [22.5,23,23,23] #GLM
and export PYTORCH_ALLOC_CONF=expandable_segments:True

chunk_size: 1024 helps prevent OOM but no way it will fit 100k.

mratsim

Owner about 12 hours ago

For the 2.10bpw quant?

Are you running Xorg/Wayland on those cards as well?

It might be that the overhead of NCCL is greater than I imagined on 4 cards. I tested on a single 96GiB RTX pro 6000 and I could load the quant.

zz2g

about 12 hours ago

Ok, tested the 2.57bpw quant, I'm just amazed how good this model is. It can handle 10 characters at the same time, I'm flabbergasted.

gghfez

about 11 hours ago

@mratsim

somehow the current cook I have has refusal and jailbreak checks, so postponing

Hey mate, any theories about this one? I'm noticing the same thing with certain <4bpw gguf quants.

These two: ubergarm/GLM-4.7-GGUF IQ3_KS and unsloth/GLM-4.6-GGUF UD-Q3_K_XL specifically.

I'm trying to tweak a compassion vs sadism control-vector (trying to get prompt stems that don't amplify slop) for this model, and tend to use smaller quants to experiment since I have to generate 15360 sample for each tweak I make.

BUT with this model, the <4bpw quants seem to reason about safety, or simply refuse with thinking disabled.

IQ3_KS:

Normally I'd expect the Max Δ to be around layer 45 for this vector.

Q8_0:

And an example ridiculous refusal:

mratsim

Owner about 11 hours ago

No idea unfortunately. Are you sure about GLM4.6 though or is it a typo? That's the first time I'm hearing about refusals for it.

Lockout

about 10 hours ago

My cards are empty. No xorg. There is another 250mb process for NCCL and native TP. As to censorship.. I have not run into much using text completion and . Yea the model is more gentle but it still does NSFW and gore. If a refusal comes up, I regenerate and it usually will bypass.

gghfez

about 10 hours ago

•

edited about 10 hours ago

Are you sure about GLM4.6 though or is it a typo?

You're right, that was a typo, I meant their 4.7 -_-!

Anyway, not really related to this EXL3 quant, I just thought you might have found something since once of your smaller quants amplified refusals.
I've never seen this happen before with any other models.

If a refusal comes up, I regenerate and it usually will bypass.

The refusals seem quant related for me. I can stick with q8_0 for now, it'll just be slower.

using text completion

Yeah my datasets are pre-formatted so equivalent to text completion.