Broken model config?
Getting this error: KeyError: 'layers.43.self_attn.qkv_proj.k_scale'
vLLM 0.14.0rc1.dev71+gf1c2c2013 (latest commit at this moment), running on NVIDIA GB10 (DGX Spark).
same issue
I had the exact same error when I attempted to run inference on vLLM. There's a difference in how the model was quantized and how vLLM tries to load it. The NVFP4 checkpoint uses a fused QKV quantization scheme. The Query, Key, and Value projections are treated as a single block (or partly fused), meaning there is likely one global scale (or input scale) for the layer, but no specific, separate scales for the Key (k_scale) or Value (v_scale) heads.
The current Glm4MoeForCausalLM loader in vLLM assumes a Split QKV scheme for NVFP4. vLLM iterates through the state dict looking for separate k_scale and v_scale tensors to load into the attention backend. When it attempts to access these keys in the safetensors file, they don't exist, triggering the crash.
What I did to fix it was modify vllm/model_executor/models/glm4_moe.py to skip the specific k_scale and v_scale parameters if they are missing from the checkpoint, rather than crashing.
Here's my implementation that got it running on my dual DGX Spark setup:
import sys
import os
import re
# Path to the vLLM model file
path = '/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py'
if os.path.exists(path):
with open(path, 'r') as f:
lines = f.readlines()
target_str = 'param = params_dict[name]'
new_lines = []
patched = False
for line in lines:
# We look for the parameter loading line
if target_str in line and 'k_scale' not in line:
whitespace = re.match(r'^(\s*)', line).group(1)
# Inject logic: If asking for k_scale/v_scale and it's missing, skip
payload = f"{whitespace}if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue\n"
new_lines.append(payload)
new_lines.append(line)
patched = True
else:
new_lines.append(line)
if patched:
with open(path, 'w') as f:
f.writelines(new_lines)
print(f"Successfully patched {path}")
else:
print("File already patched or target not found.")
OK, thanks, I'll try it.
Here is a diff that you can feed to patch command:
--- a/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
+++ b/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
@@ -537,6 +537,7 @@
if is_pp_missing_parameter(name, self):
continue
+ if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
param = params_dict[name]
weight_loader = param.weight_loader
weight_loader(param, loaded_weight, shard_id)
@@ -596,6 +597,7 @@
if is_pp_missing_parameter(name, self):
continue
+ if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
param = params_dict[name]
weight_loader = getattr(
param, "weight_loader", default_weight_loader
to patch with just one command:
cat <<'EOM' | patch -p1 -d /
--- a/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
+++ b/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
@@ -537,6 +537,7 @@
if is_pp_missing_parameter(name, self):
continue
+ if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
param = params_dict[name]
weight_loader = param.weight_loader
weight_loader(param, loaded_weight, shard_id)
@@ -596,6 +597,7 @@
if is_pp_missing_parameter(name, self):
continue
+ if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
param = params_dict[name]
weight_loader = getattr(
param, "weight_loader", default_weight_loader
EOM
With this patch I can confirm that it's working.
Do you think this modification could negatively affect other quants? If not, you may consider submitting a pull request to vLLM.
It shouldn't affect other quants as it just skips missing k and v scale params instead of hard-crashing if there's a missing value. It could introduce instability if the model is actually broken though. I'll swing by vLLM and see if they have anything like this cooked up already.