GigaChat3-10B-A1.8B GGUF [EXPERIMENTAL]
⚠️ UNSTABLE BUILD - This is an experimental GGUF conversion with known quality issues. Use for testing only.
UPDATE - Currently, does not work with llama.cpp release b7127 and higher. Only releases below.
What is this?
Experimental GGUF conversion of GigaChat3-10B-A1.8B - a Russian dialogue model with MoE + MLA architecture.
Model specs:
- 10B parameters (1.8B active)
- 64 experts, 4 active per token
- 262k context window
- BF16 → GGUF conversion
⚠️ Known Issues
This conversion has degraded quality compared to the original model due to architectural incompatibility:
- Hybrid MLA problem: GigaChat3 uses standard Q-projection (no compression) + compressed KV-cache, which llama.cpp doesn't support natively
- RoPE mismatch: Position embeddings are applied in wrong dimensional space
- Symptoms: Incoherent long-form generation, context confusion, occasional nonsense
Why it still loads: We emulated missing MLA components using Identity matrices, which satisfies llama.cpp's loader but breaks positional logic.
When to use this
✅ Good for:
- Short prompts (1-3 turns)
- Fact retrieval / memorized knowledge
- Testing GGUF tooling compatibility
- Placeholder until proper support arrives
❌ Bad for:
- Production use
- Long conversations
- Complex reasoning tasks
- Anything requiring positional awareness
Conversion method
# 1. Restructure weights to emulate MLA
# Original: Q = X @ q_proj [6144, 1536]
# Emulated: Q = ((X @ Identity[1536,1536]) * ones) @ q_proj[6144,1536]
# 2. Convert with q_lora_rank = 1536
python prepare_weights.py # Creates fake q_a_proj, q_a_norm, q_b_proj
python convert_hf_to_gguf.py ./model-fixed --outfile model.gguf
Math is preserved, but RoPE positioning is broken.
Usage
# llama.cpp
./llama-cli -m model.gguf \
--temp 0.3 --top-p 0.9 -n 512 \
-p "User: [query]\nAssistant:"
# Recommended params
temperature: 0.0-0.5
top_p: 0.8-0.9
max_tokens: < 512 (quality degrades further out)
Better alternatives
For production quality, use the original model with:
- vLLM (native FP8 support, proper inference)
- transformers (HF native, slower but correct)
- SGLang (fast + correct)
Or wait for proper llama.cpp support (requires C++ patch).
Technical details
Problem: llama.cpp DeepSeek implementation assumes Q-vectors are compressed (q_lora_rank < hidden_size). GigaChat3 skips Q-compression.
Hack: Set q_lora_rank = hidden_size (1536) and inject Identity matrices to fake compression.
Result: Loader accepts it, but RoPE gets applied to wrong intermediate representation → broken positional encoding → quality loss.
Future
If you're a llama.cpp dev: The fix is adding a branch for q_lora_rank == null in the DeepSeek V3 \ V2 attention code (~100 LOC). Happy to help test!
License
MIT (inherited from base model)
- Downloads last month
- 230
16-bit
Model tree for whoy/GigaChat3-10B-A1.8B-bf16-gguf-unstable
Base model
ai-sage/GigaChat3-10B-A1.8B-bf16