GLM-4.7-GGUF
I am currently looking for open positions! ๐ค If you find this model useful or are looking for a talented AI/LLM Engineer, please reach out to me on LinkedIn: Aaryan Kapoor.
Description
This repository contains GGUF format model files for Zhipu AI's GLM-4.7.
GLM-4.7 is a powerful open-weights model designed for complex reasoning, agentic coding, and tool use. It supports "Thinking" (Chain of Thought) natively within its chat template.
Performances on Benchmarks. More detailed comparisons of GLM-4.7 with other models GPT-5-High, GPT-5.1-High, Claude Sonnet 4.5, Gemini 3.0 Pro, DeepSeek-V3.2, Kimi K2 Thinking, on 17 benchmarks (including 8 reasoning, 5 coding, and 3 agents benchmarks) can be seen in the below table.
| Benchmark | GLM-4.7 | GLM-4.6 | Kimi K2 Thinking | DeepSeek-V3.2 | Gemini 3.0 Pro | Claude Sonnet 4.5 | GPT-5-High | GPT-5.1-High |
|---|---|---|---|---|---|---|---|---|
| MMLU-Pro | 84.3 | 83.2 | 84.6 | 85.0 | 90.1 | 88.2 | 87.5 | 87.0 |
| GPQA-Diamond | 85.7 | 81.0 | 84.5 | 82.4 | 91.9 | 83.4 | 85.7 | 88.1 |
| HLE | 24.8 | 17.2 | 23.9 | 25.1 | 37.5 | 13.7 | 26.3 | 25.7 |
| HLE (w/ Tools) | 42.8 | 30.4 | 44.9 | 40.8 | 45.8 | 32.0 | 35.2 | 42.7 |
| AIME 2025 | 95.7 | 93.9 | 94.5 | 93.1 | 95.0 | 87.0 | 94.6 | 94.0 |
| HMMT Feb. 2025 | 97.1 | 89.2 | 89.4 | 92.5 | 97.5 | 79.2 | 88.3 | 96.3 |
| HMMT Nov. 2025 | 93.5 | 87.7 | 89.2 | 90.2 | 93.3 | 81.7 | 89.2 | - |
| IMOAnswerBench | 82.0 | 73.5 | 78.6 | 78.3 | 83.3 | 65.8 | 76.0 | - |
| LiveCodeBench-v6 | 84.9 | 82.8 | 83.1 | 83.3 | 90.7 | 64.0 | 87.0 | 87.0 |
| SWE-bench Verified | 73.8 | 68.0 | 71.3 | 73.1 | 76.2 | 77.2 | 74.9 | 76.3 |
| SWE-bench Multilingual | 66.7 | 53.8 | 61.1 | 70.2 | - | 68.0 | 55.3 | - |
| Terminal Bench Hard | 33.3 | 23.6 | 30.6 | 35.4 | 39.0 | 33.3 | 30.5 | 43.0 |
| Terminal Bench 2.0 | 41.0 | 24.5 | 35.7 | 46.4 | 54.2 | 42.8 | 35.2 | 47.6 |
| BrowseComp | 52.0 | 45.1 | - | 51.4 | - | 24.1 | 54.9 | 50.8 |
| BrowseComp (w/ Context Manage) | 67.5 | 57.5 | 60.2 | 67.6 | 59.2 | - | - | - |
| BrowseComp-Zh | 66.6 | 49.5 | 62.3 | 65.0 | - | 42.4 | 63.0 | - |
| ฯยฒ-Bench | 87.4 | 75.2 | 74.3 | 85.3 | 90.7 | 87.2 | 82.4 | 82.7 |
How to Run (llama.cpp)
Important: This model uses "Thinking" (Chain of Thought), which consumes significant context. You must increase the generation limit (-n) and specify stop tokens to prevent infinite loops.
1. CLI Inference (Interactive Chat)
./llama-cli -m GLM-4.7.Q4_K_M.gguf \
-n 2048 \ # Allow enough tokens for "Thinking"
-c 8192 \ # Adjust context based on VRAM
--temp 0.7 \ # Recommended for reasoning
--top-p 0.9 \
-ngl 99 \ # Offload layers to GPU (Reduce if OOM)
-r "<|user|>,<|observation|>" \ # CRITICAL: Prevents infinite generation loops
-cnv \ # Enable Conversation Mode
-p "Hello"
Note: If you want to see the internal "Thinking" process (the text between
<think>tags), add the--specialflag to the command.
2. Server Mode (API)
Running a persistent server is recommended for this size model to avoid reloading times.
./llama-server -m GLM-4.7.Q4_K_M.gguf \
--port 8080 \
-ngl 99 \
-c 8192 \
-n 2048 \
--alias glm4
API Request Example (JSON):
When using the API, ensure you include the stop tokens in your payload:
{
"model": "glm4",
"messages": [
{ "role": "user", "content": "Explain quantum computing." }
],
"stop": ["<|user|>", "<|observation|>"],
"max_tokens": 2048
}
Hardware Requirements
Full GPU Offloading (
-ngl 99): Requires ~130GB VRAM for Q4_K_M (e.g., 2x A100 80GB or Mac Studio Ultra).Split Offloading: For single A100 (80GB) cards, use Q2_K or IQ2_XXS and set
-ngl 40(adjust based on available VRAM) to split the model between GPU and System RAM. Default Settings (Most Tasks)temperature:
1.0top-p:
0.95max new tokens:
131072
For multi-turn agentic tasks (ฯยฒ-Bench and Terminal Bench 2), please turn on Preserved Thinking mode.
CLI Example
./llama-cli -m GLM-4.7.Q4_K_M.gguf \
-c 8192 \
--temp 1.0 \
--top-p 0.95 \
-p "[gMASK]<sop><|system|>\nYou are a helpful assistant.<|user|>\nWrite a Python script to calculate Fibonacci numbers.<|assistant|>\n<think>" \
-cnv
- Downloads last month
- 1,858
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for AaryanK/GLM-4.7-GGUF
Base model
zai-org/GLM-4.7