Broken model config?

#3
by eugreugr - opened

Getting this error: KeyError: 'layers.43.self_attn.qkv_proj.k_scale'
vLLM 0.14.0rc1.dev71+gf1c2c2013 (latest commit at this moment), running on NVIDIA GB10 (DGX Spark).

same issue

I had the exact same error when I attempted to run inference on vLLM. There's a difference in how the model was quantized and how vLLM tries to load it. The NVFP4 checkpoint uses a fused QKV quantization scheme. The Query, Key, and Value projections are treated as a single block (or partly fused), meaning there is likely one global scale (or input scale) for the layer, but no specific, separate scales for the Key (k_scale) or Value (v_scale) heads.

The current Glm4MoeForCausalLM loader in vLLM assumes a Split QKV scheme for NVFP4. vLLM iterates through the state dict looking for separate k_scale and v_scale tensors to load into the attention backend. When it attempts to access these keys in the safetensors file, they don't exist, triggering the crash.

What I did to fix it was modify vllm/model_executor/models/glm4_moe.py to skip the specific k_scale and v_scale parameters if they are missing from the checkpoint, rather than crashing.

Here's my implementation that got it running on my dual DGX Spark setup:

import sys
import os
import re

# Path to the vLLM model file
path = '/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py'

if os.path.exists(path):
    with open(path, 'r') as f:
        lines = f.readlines()
    
    target_str = 'param = params_dict[name]'
    new_lines = []
    patched = False
    
    for line in lines:
        # We look for the parameter loading line
        if target_str in line and 'k_scale' not in line:
            whitespace = re.match(r'^(\s*)', line).group(1)
            
            # Inject logic: If asking for k_scale/v_scale and it's missing, skip
            payload = f"{whitespace}if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue\n"
            
            new_lines.append(payload)
            new_lines.append(line)
            patched = True
        else:
            new_lines.append(line)
            
    if patched:
        with open(path, 'w') as f:
            f.writelines(new_lines)
        print(f"Successfully patched {path}")
    else:
        print("File already patched or target not found.")

OK, thanks, I'll try it.
Here is a diff that you can feed to patch command:

--- a/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
+++ b/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
@@ -537,6 +537,7 @@
                 if is_pp_missing_parameter(name, self):
                     continue

+                if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
                 param = params_dict[name]
                 weight_loader = param.weight_loader
                 weight_loader(param, loaded_weight, shard_id)
@@ -596,6 +597,7 @@
                     if is_pp_missing_parameter(name, self):
                         continue

+                    if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
                     param = params_dict[name]
                     weight_loader = getattr(
                         param, "weight_loader", default_weight_loader

to patch with just one command:

cat <<'EOM' | patch -p1 -d /
--- a/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
+++ b/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/glm4_moe.py
@@ -537,6 +537,7 @@
                 if is_pp_missing_parameter(name, self):
                     continue

+                if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
                 param = params_dict[name]
                 weight_loader = param.weight_loader
                 weight_loader(param, loaded_weight, shard_id)
@@ -596,6 +597,7 @@
                     if is_pp_missing_parameter(name, self):
                         continue

+                    if ('k_scale' in name or 'v_scale' in name) and name not in params_dict: continue
                     param = params_dict[name]
                     weight_loader = getattr(
                         param, "weight_loader", default_weight_loader
EOM

With this patch I can confirm that it's working.
Do you think this modification could negatively affect other quants? If not, you may consider submitting a pull request to vLLM.

It shouldn't affect other quants as it just skips missing k and v scale params instead of hard-crashing if there's a missing value. It could introduce instability if the model is actually broken though. I'll swing by vLLM and see if they have anything like this cooked up already.

Salyut1 changed discussion status to closed

Sign up or log in to comment