GLM4.5V having issues with bfloat16 / float16 training

Aac13 · August 26, 2025, 12:41am

I am trying to run SFT with QLoRA on GLM4.5V. I would like to train in either bfloat16 or float16, but there is some mismatch between dtypes when actually running the model:

MODEL_ID = “zai-org/GLM-4.5V” bnb = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=“nf4”, bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, ) model = Glm4vMoeForConditionalGeneration.from_pretrained( MODEL_ID, trust_remote_code=True, quantization_config=bnb, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map={“”: 0}, ) File /usr/local/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py:342, in Glm4vMoeTextMoE.forward(self, hidden_states) 340 topk_indices, topk_weights = self.gate(hidden_states) 341 hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) → 342 hidden_states = self.moe(hidden_states, topk_indices, topk_weights).view(orig_shape) 343 hidden_states = hidden_states + self.shared_experts(residuals) 344 return hidden_states File /usr/local/lib/python3.11/site-packages/transformers/models/glm4v_moe/modeling_glm4v_moe.py:330, in Glm4vMoeTextMoE.moe(self, hidden_states, topk_indices, topk_weights) 328 expert_output = expert(expert_input) 329 weighted_output = expert_output expert_weights.unsqueeze(-1) → 330 final_hidden_states.indexadd(0, token_indices, weighted_output) 332 # in original deepseek, the output of the experts are gathered once we leave this module 333 # thus the moe module is itelsf an IsolatedParallel module 334 # and all expert are “local” meaning we shard but we don’t gather 335 return final_hidden_states.type(hidden_states.dtype) RuntimeError: indexadd(): self (Half) and source (Float) must have the same scalar type

Can someone help working with GLM4.5V? I am using the preview build (transformers-v4.55.0-GLM-4.5V-preview) shown here zai-org/GLM-4.5V · Hugging Face

John6666 · August 26, 2025, 3:30am

For now, it may be more stable to fix the dtype of the model and the training dtype to one. Also, using Flash Attention 2 seems to cause bugs.

DTYPE = torch.bfloat16 # or torch.float16

nf4_config = BitsAndBytesConfig(load_in_4bit=True,
                         bnb_4bit_quant_type="nf4",
                         bnb_4bit_use_double_quant=True,
                         bnb_4bit_compute_dtype=DTYPE)

model = Glm4vMoeForConditionalGeneration.from_pretrained(
    "zai-org/GLM-4.5V",
    quantization_config=nf4_config,
    torch_dtype=DTYPE, # avoid torch_dtype="auto" https://github.com/pytorch/torchtune/issues/1349
    attn_implementation="eager",  # avoid FA2 during finetune https://github.com/zai-org/GLM-V/issues/149
    use_cache=False,
)
model.to(dtype=DTYPE)

Topic		Replies	Views
GPTQ model to bfloat16 🤗Transformers	0	455	January 10, 2024
How to generate using a fine-tuned qlora cast to bfloat16 Beginners	1	1232	April 6, 2024
Loading in Float32 vs Float16 has very different speed 🤗Transformers	1	265	February 20, 2025
Understanding how changing bnb_4bit_compute_dtype affects outputs 🤗Transformers	1	5009	February 10, 2024
Expected mat1 and mat2 to have the same dtype, but got: c10::Half != float 🤗Transformers	3	1987	July 8, 2024

GLM4.5V having issues with bfloat16 / float16 training

Related topics