Quantization of ai-sage/GigaChat3-10B-A1.8B-bf16
The pure Q8_0 quant runs on both on both mainline llama.cpp and ik_llama.cpp. The other quants in this collection REQUIRE ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Finally, I really appreciate all the support from aifoundry.org so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants!
Quant Collection
Perplexity computed against wiki.test.raw.
BF16 19.884 GiB (16.004 BPW)
Final estimate: PPL over 610 chunks for n_ctx=512 = 6.7281 +/- 0.04227
Not uploaded, just baseline measurement for full size unquantized model.
Q8_0 10.568 GiB (8.506 BPW)
Final estimate: PPL over 610 chunks for n_ctx=512 = 6.7287 +/- 0.04226
This will run on either ik_llama.cpp or mainline llama.cpp. Be sure to update to get PRs listed below.
๐ Secret Recipe
#!/usr/bin/env bash
./build/bin/llama-quantize \
--pure \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-Q8_0.gguf \
Q8_0 \
128
IQ5_K 7.598 GiB (6.115 BPW)
Final estimate: PPL over 610 chunks for n_ctx=512 = 6.7510 +/- 0.04244
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k
token_embd\.weight=iq6_k
output\.weight=iq6_k
"""
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-IQ5_K.gguf \
IQ5_K \
64
IQ4_KSS 5.654 GiB (4.551 BPW)
Final estimate: PPL over 610 chunks for n_ctx=512 = 6.8721 +/- 0.04330
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
token_embd\.weight=iq6_k
output\.weight=iq6_k
"""
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-IQ4_KSS.gguf \
IQ4_KSS \
64
IQ2_KT 3.869 GiB (3.114 BPW)
Final estimate: PPL over 610 chunks for n_ctx=512 = 7.8891 +/- 0.05058
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
token_embd\.weight=iq6_k
output\.weight=iq6_k
"""
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-IQ2_KT.gguf \
IQ2_KT \
64
smol-IQ1_KT 3.042 GiB (2.448 BPW)
Final estimate: PPL over 610 chunks for n_ctx=512 = 9.7675 +/- 0.06444
only for the desperate
๐ Secret Recipe
#!/usr/bin/env bash
custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0
## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq1_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
token_embd\.weight=iq4_k
output\.weight=iq6_k
"""
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
/mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-smol-IQ1_KT.gguf \
IQ1_KT \
64
Quick Start
# Example running on mainline llama.cpp CPU-only
./build/bin/llama-server \
--model "$model"\
--alias ubergarm/GigaChat3-10B-A1.8B-GGUF \
--ctx-size 32768 \
--parallel 1 \
--threads 8 \
--host 127.0.0.1 \
--port 8080 \
--no-mmap \
--jinja
Tips:
- for full offload onto GPU just add
-ngl 99and use one thread with--threads 1 - to save space on kv-cache use
-ctk q8_0which is all you need given this is MLA - bring your own jinja chat template with
--jinja --chat-template-file ./myFixedTemplate.jinja
References
- Downloads last month
- 3,571
Model tree for ubergarm/GigaChat3-10B-A1.8B-GGUF
Base model
ai-sage/GigaChat3-10B-A1.8B-bf16