Qwen3-VL-8B-Instruct-NVFP4
Intro
- Original model: Qwen/Qwen3-VL-8B-Instruct
- Quantization: NVFP4 using LLMCompressor
- Activation precision: W4A4
- GPU support: Blackwell GPUs only (e.g., RTX 50xx series)
This model is fully compatible with vLLM and optimized to run on a single GPU with 16GB VRAM, achieving ~1.5× faster performance compared to the INT4 version, with potentially better accuracy.
Calibration Data
DATASET_ID = "neuralmagic/calibration"
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 8192
Tested on:
- cuda: 13.0
- vllm: 0.11.0
- torch: 2.8.0
- flashinfer-cubin:0.5.1
- flashinfer-jit-cache:0.5.1+cu130
- flashinfer-python:0.5.1
- transformers:4.57.1
vllm serving command
VLLM_SLEEP_WHEN_IDLE=1 vllm serve lhoang8500/Qwen3-VL-8B-Instruct-NVFP4 --max-model-len 32768 -tp 1 --limit_mm_per_prompt '{"image":1, "video":0}' --kv-cache-dtype fp8 --gpu-memory-utilization 0.9 --max-num-seqs 64 --tool-call-parser hermes --enable-auto-tool-choice
Generation hyper parameteres
greedy='false'
seed=3407
top_p=0.8
top_k=20
temperature=0.7
repetition_penalty=1.0
presence_penalty=1.5
out_seq_length=32768
Quantization code
see file quantize.py, credit to llmcompressor
- Downloads last month
- 92
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for lhoang8500/Qwen3-VL-8B-Instruct-NVFP4
Base model
Qwen/Qwen3-VL-8B-Instruct