Ling-1T-GGUF / README.md
ubergarm's picture
initial commit
bc0c484
|
raw
history blame
3.9 kB
metadata
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: inclusionAI/Ling-1T
license: mit
base_model_relation: quantized
tags:
  - imatrix
  - bailing_moe
  - conversational
  - ik_llama.cpp

ik_llama.cpp imatrix Quantizations of inclusionAI/Ling-1T

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

Perplexity Chart

This one is just a test quant for baseline perplexity comparison:

  • Q8_0 989.678 GiB (8.504 BPW)
    • Final estimate: PPL = TODO

smol-IQ4_KSS TODO

Final estimate: PPL = TODO

smol-IQ2_KS TODO

Final estimate: PPL = TODO

Should hopefully fit in 249.38 GiB RAM + 14.3 GiB VRAM + kv-cache/context...🤞

Leaving the attn.*/first 4 dense layers/shexp at full q8_0 would take about 20.1 GiB VRAM, might do some other quants like that for folks with more VRAM.

👈 Secret Recipe
custom="
# 80 Repeating Layers [0-79]

# Attention
blk\..*\.attn_qkv.*=iq6_k
blk\..*\.attn_output.*=iq6_k

# First 4 Dense Layers [0-3]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq5_ks

# Shared Expert Layers [3-79]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks

# Routed Experts Layers [3-79]
blk\..*\.ffn_down_exps\.weight=iq2_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Ling-1T-GGUF/imatrix-Ling-1T-Q8_0.dat \
    /mnt/data/models/ubergarm/Ling-1T-GGUF/Ling-1T-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Ling-1T-GGUF/Ling-1T-smol-IQ2_KS.gguf
    IQ2_KS \
    192

Quick Start

echo TODO

References