Qwen3-VL-8B-Instruct-NVFP4

Intro

  • Original model: Qwen/Qwen3-VL-8B-Instruct
  • Quantization: NVFP4 using LLMCompressor
  • Activation precision: W4A4
  • GPU support: Blackwell GPUs only (e.g., RTX 50xx series)

This model is fully compatible with vLLM and optimized to run on a single GPU with 16GB VRAM, achieving ~1.5× faster performance compared to the INT4 version, with potentially better accuracy.


Calibration Data

DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 8192

Tested on:

  • cuda: 13.0
  • vllm: 0.11.0
  • torch: 2.8.0
  • flashinfer-cubin:0.5.1
  • flashinfer-jit-cache:0.5.1+cu130
  • flashinfer-python:0.5.1
  • transformers:4.57.1

vllm serving command

VLLM_SLEEP_WHEN_IDLE=1 vllm serve lhoang8500/Qwen3-VL-8B-Instruct-NVFP4 --max-model-len 32768 -tp 1 --limit_mm_per_prompt '{"image":1, "video":0}' --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --max-num-seqs 64 --tool-call-parser hermes --enable-auto-tool-choice

Generation hyper parameteres

greedy='false'
seed=3407
top_p=0.8
top_k=20
temperature=0.7
repetition_penalty=1.0
presence_penalty=1.5
out_seq_length=32768

Quantization code

see file quantize.py, credit to llmcompressor

Downloads last month
92
Safetensors
Model size
6B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lhoang8500/Qwen3-VL-8B-Instruct-NVFP4

Quantized
(40)
this model

Dataset used to train lhoang8500/Qwen3-VL-8B-Instruct-NVFP4