Qwen3-VL-8B-Instruct-NVFP4

Intro

Original model: Qwen/Qwen3-VL-8B-Instruct
Quantization: NVFP4 using LLMCompressor
Activation precision: W4A4
GPU support: Blackwell GPUs only (e.g., RTX 50xx series)

This model is fully compatible with vLLM and optimized to run on a single GPU with 16GB VRAM, achieving ~1.5× faster performance compared to the INT4 version, with potentially better accuracy.

Calibration Data

DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 8192

Tested on:

cuda: 13.0
vllm: 0.11.0
torch: 2.8.0
flashinfer-cubin:0.5.1
flashinfer-jit-cache:0.5.1+cu130
flashinfer-python:0.5.1
transformers:4.57.1

vllm serving command

VLLM_SLEEP_WHEN_IDLE=1 vllm serve lhoang8500/Qwen3-VL-8B-Instruct-NVFP4 --max-model-len 32768 -tp 1 --limit_mm_per_prompt '{"image":1, "video":0}' --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --max-num-seqs 64 --tool-call-parser hermes --enable-auto-tool-choice

Generation hyper parameteres

greedy='false'
seed=3407
top_p=0.8
top_k=20
temperature=0.7
repetition_penalty=1.0
presence_penalty=1.5
out_seq_length=32768

Quantization code

see file quantize.py, credit to llmcompressor

Downloads last month: 92

Safetensors

Model size

6B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lhoang8500/Qwen3-VL-8B-Instruct-NVFP4

Base model

Qwen/Qwen3-VL-8B-Instruct

Quantized

(40)

this model

lhoang8500
/

Qwen3-VL-8B-Instruct-NVFP4