Q6_K barely smaller than Q8_0?

by mingyi456 - opened 10 days ago

10 days ago

I clicked on the expand button on the first 12 layers, and it seems that for all of them, only attn_output.weight is at Q6_K, and the rest are at Q8_0. Is this intended, or is it a bug in your automated system?

nicoboss

9 days ago

This is expected behavior for any GPT OSS based model as the original model was trained in and published in MXFP4 which llama.cpp does not requantize. The only quant that doesn't degrade the performance of the original model and makes any sense to use for any GPT OSS based model are the MXFP4_MOE quants. We are considering if it even makes any sense to provide any other than MXFP4 quants for GPT OSS based models as they seem quite useless. Especially any quant above 4 bits honestly really doesn't make any sense if the model was trained in 4 bits.

I highly recommend you just get the following quant: https://huggingface.co/mradermacher/gpt-oss-20b-Derestricted-i1-GGUF/blob/main/gpt-oss-20b-Derestricted.i1-MXFP4_MOE.gguf

mingyi456

8 days ago

•

edited 8 days ago

But this is not the original GPT-OSS model that was post-trained and released in MXFP4, this is a abliterated version by ArliAI, and it was mentioned in the Reddit post that they converted it to BF16 before the abliteration process, and after abliteration the weight should no longer be "distributed" in a way that is suitable for MXFP4 quantization anymore. So I believe MXFP4 might not be the optimal quant for this model, given that the brief discussion at https://github.com/ikawrakow/ik_llama.cpp/pull/682 suggests that MXFP4 is not a good quantization scheme on its own.

But I downloaded your linked MXFP4 version to give it a test alongside the Q5_K_M version. With a very quick rough test, it feels that MXFP4 seems to hallucinate or mess up formatting more often.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment