Quantization of ai-sage/GigaChat3-10B-A1.8B-bf16

The pure Q8_0 quant runs on both on both mainline llama.cpp and ik_llama.cpp. The other quants in this collection REQUIRE ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Finally, I really appreciate all the support from aifoundry.org so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants!

Quant Collection

Perplexity computed against wiki.test.raw.

BF16 19.884 GiB (16.004 BPW)

Final estimate: PPL over 610 chunks for n_ctx=512 = 6.7281 +/- 0.04227

Not uploaded, just baseline measurement for full size unquantized model.

Q8_0 10.568 GiB (8.506 BPW)

Final estimate: PPL over 610 chunks for n_ctx=512 = 6.7287 +/- 0.04226

This will run on either ik_llama.cpp or mainline llama.cpp. Be sure to update to get PRs listed below.

👈 Secret Recipe

#!/usr/bin/env bash

./build/bin/llama-quantize \
    --pure \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-Q8_0.gguf \
    Q8_0 \
    128

IQ5_K 7.598 GiB (6.115 BPW)

Final estimate: PPL over 610 chunks for n_ctx=512 = 6.7510 +/- 0.04244

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

token_embd\.weight=iq6_k
output\.weight=iq6_k
"""

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-IQ5_K.gguf \
    IQ5_K \
    64

IQ4_KSS 5.654 GiB (4.551 BPW)

Final estimate: PPL over 610 chunks for n_ctx=512 = 6.8721 +/- 0.04330

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

token_embd\.weight=iq6_k
output\.weight=iq6_k
"""

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-IQ4_KSS.gguf \
    IQ4_KSS \
    64

IQ2_KT 3.869 GiB (3.114 BPW)

Final estimate: PPL over 610 chunks for n_ctx=512 = 7.8891 +/- 0.05058

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt

token_embd\.weight=iq6_k
output\.weight=iq6_k
"""

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-IQ2_KT.gguf \
    IQ2_KT \
    64

smol-IQ1_KT 3.042 GiB (2.448 BPW)

Final estimate: PPL over 610 chunks for n_ctx=512 = 9.7675 +/- 0.06444

only for the desperate

👈 Secret Recipe

#!/usr/bin/env bash

custom="
## Attention [0-25] (GPU)
blk\..*\.attn.*\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-25] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-25] (CPU)
blk\..*\.ffn_down_exps\.weight=iq1_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

token_embd\.weight=iq4_k
output\.weight=iq6_k
"""

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/imatrix-GigaChat3-10B-A1.8B-BF16.dat \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-BF16.gguf \
    /mnt/data/models/ubergarm/GigaChat3-10B-A1.8B-GGUF/GigaChat3-10B-A1.8B-smol-IQ1_KT.gguf \
    IQ1_KT \
    64

Quick Start

# Example running on mainline llama.cpp CPU-only
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/GigaChat3-10B-A1.8B-GGUF \
    --ctx-size 32768 \
    --parallel 1 \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja

Tips:

for full offload onto GPU just add -ngl 99 and use one thread with --threads 1
to save space on kv-cache use -ctk q8_0 which is all you need given this is MLA
bring your own jinja chat template with --jinja --chat-template-file ./myFixedTemplate.jinja

References

Downloads last month: 3,571

GGUF

Model size

11B params

Architecture

deepseek2

Hardware compatibility

2-bit

8-bit

View +1 variant

Model tree for ubergarm/GigaChat3-10B-A1.8B-GGUF

Base model

ai-sage/GigaChat3-10B-A1.8B-bf16

Quantized

(13)

this model