GLM-4.7-GGUF

I am currently looking for open positions! ๐Ÿค— If you find this model useful or are looking for a talented AI/LLM Engineer, please reach out to me on LinkedIn: Aaryan Kapoor.

Description

This repository contains GGUF format model files for Zhipu AI's GLM-4.7.

GLM-4.7 is a powerful open-weights model designed for complex reasoning, agentic coding, and tool use. It supports "Thinking" (Chain of Thought) natively within its chat template.

Performances on Benchmarks. More detailed comparisons of GLM-4.7 with other models GPT-5-High, GPT-5.1-High, Claude Sonnet 4.5, Gemini 3.0 Pro, DeepSeek-V3.2, Kimi K2 Thinking, on 17 benchmarks (including 8 reasoning, 5 coding, and 3 agents benchmarks) can be seen in the below table.

Benchmark GLM-4.7 GLM-4.6 Kimi K2 Thinking DeepSeek-V3.2 Gemini 3.0 Pro Claude Sonnet 4.5 GPT-5-High GPT-5.1-High
MMLU-Pro 84.3 83.2 84.6 85.0 90.1 88.2 87.5 87.0
GPQA-Diamond 85.7 81.0 84.5 82.4 91.9 83.4 85.7 88.1
HLE 24.8 17.2 23.9 25.1 37.5 13.7 26.3 25.7
HLE (w/ Tools) 42.8 30.4 44.9 40.8 45.8 32.0 35.2 42.7
AIME 2025 95.7 93.9 94.5 93.1 95.0 87.0 94.6 94.0
HMMT Feb. 2025 97.1 89.2 89.4 92.5 97.5 79.2 88.3 96.3
HMMT Nov. 2025 93.5 87.7 89.2 90.2 93.3 81.7 89.2 -
IMOAnswerBench 82.0 73.5 78.6 78.3 83.3 65.8 76.0 -
LiveCodeBench-v6 84.9 82.8 83.1 83.3 90.7 64.0 87.0 87.0
SWE-bench Verified 73.8 68.0 71.3 73.1 76.2 77.2 74.9 76.3
SWE-bench Multilingual 66.7 53.8 61.1 70.2 - 68.0 55.3 -
Terminal Bench Hard 33.3 23.6 30.6 35.4 39.0 33.3 30.5 43.0
Terminal Bench 2.0 41.0 24.5 35.7 46.4 54.2 42.8 35.2 47.6
BrowseComp 52.0 45.1 - 51.4 - 24.1 54.9 50.8
BrowseComp (w/ Context Manage) 67.5 57.5 60.2 67.6 59.2 - - -
BrowseComp-Zh 66.6 49.5 62.3 65.0 - 42.4 63.0 -
ฯ„ยฒ-Bench 87.4 75.2 74.3 85.3 90.7 87.2 82.4 82.7

How to Run (llama.cpp)

Important: This model uses "Thinking" (Chain of Thought), which consumes significant context. You must increase the generation limit (-n) and specify stop tokens to prevent infinite loops.

1. CLI Inference (Interactive Chat)

./llama-cli -m GLM-4.7.Q4_K_M.gguf \
  -n 2048 \                  # Allow enough tokens for "Thinking"
  -c 8192 \                  # Adjust context based on VRAM
  --temp 0.7 \               # Recommended for reasoning
  --top-p 0.9 \
  -ngl 99 \                  # Offload layers to GPU (Reduce if OOM)
  -r "<|user|>,<|observation|>" \  # CRITICAL: Prevents infinite generation loops
  -cnv \                     # Enable Conversation Mode
  -p "Hello"

Note: If you want to see the internal "Thinking" process (the text between <think> tags), add the --special flag to the command.

2. Server Mode (API)

Running a persistent server is recommended for this size model to avoid reloading times.

./llama-server -m GLM-4.7.Q4_K_M.gguf \
  --port 8080 \
  -ngl 99 \
  -c 8192 \
  -n 2048 \
  --alias glm4

API Request Example (JSON):

When using the API, ensure you include the stop tokens in your payload:

{
  "model": "glm4",
  "messages": [
    { "role": "user", "content": "Explain quantum computing." }
  ],
  "stop": ["<|user|>", "<|observation|>"],
  "max_tokens": 2048
}

Hardware Requirements

  • Full GPU Offloading (-ngl 99): Requires ~130GB VRAM for Q4_K_M (e.g., 2x A100 80GB or Mac Studio Ultra).

  • Split Offloading: For single A100 (80GB) cards, use Q2_K or IQ2_XXS and set -ngl 40 (adjust based on available VRAM) to split the model between GPU and System RAM. Default Settings (Most Tasks)

  • temperature: 1.0

  • top-p: 0.95

  • max new tokens: 131072

For multi-turn agentic tasks (ฯ„ยฒ-Bench and Terminal Bench 2), please turn on Preserved Thinking mode.

CLI Example

./llama-cli -m GLM-4.7.Q4_K_M.gguf \
  -c 8192 \
  --temp 1.0 \
  --top-p 0.95 \
  -p "[gMASK]<sop><|system|>\nYou are a helpful assistant.<|user|>\nWrite a Python script to calculate Fibonacci numbers.<|assistant|>\n<think>" \
  -cnv
Downloads last month
1,858
GGUF
Model size
358B params
Architecture
glm4moe
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for AaryanK/GLM-4.7-GGUF

Base model

zai-org/GLM-4.7
Quantized
(16)
this model