---
license: apache-2.0
base_model:
- Qwen/Qwen3-Coder-480B-A35B-Instruct
---
Qwen3-Coder-480B-A35B-Instruct Model NVFP4 Quantized
**Qwen3‑Coder‑480B‑A35B‑Instruct Model Comparison Full vs NVFP4**
------
## Test Configuration
| Parameter | Setting |
| ----------------------------- | ----------------------------------- |
| **Full‑Precision Model** | DGX-B300 / 4 GPU |
| **NVFP4 Quantized Model** | DGX-B300 / 4 GPU |
| **Inference Engine** | TRT‑LLM (TensorRT‑LLM) |
| **Tested Concurrency Levels** | 1, 2, 4, 8, 16, 32 |
| **Prompt Length** | ≈ 128 tokens (64 different prompts) |
| **Maximum Response Length** | 128 tokens |
## Performance Metrics Comparison
### 1. Time to First Token (TTFT) – milliseconds
| Full Model | NVFP Model |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
|
|
| Concurrency | Full Model | NVFP4 Model | Δ (ms) | Performance Loss |
| ----------- | ---------- | ----------- | ------ | ---------------- |
| 1 | 73.46 | 92.56 | +19.10 | +26.0 % |
| 2 | 136.82 | 173.48 | +36.66 | +26.8 % |
| 4 | 130.01 | 163.84 | +33.83 | +26.0 % |
| 8 | 136.87 | 177.42 | +40.55 | +29.6 % |
| 16 | 163.07 | 174.25 | +11.18 | +6.9 % |
| 32 | 134.69 | 169.11 | +34.42 | +25.6 % |
**TTFT Analysis**
- The NVFP4 model shows an average **+26.5 %** higher TTFT across all concurrency levels.
- The greatest performance degradation occurs at concurrency 8 (**+29.6 %**).
- The smallest degradation is at concurrency 16 (**+6.9 %**).
------
### 2. Inter‑Token Latency (ITL) – milliseconds
| Full Model | NVFP Model |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
|
|
| Concurrency | Full Model | NVFP4 Model | Δ (ms) | Performance Loss |
| ----------- | ---------- | ----------- | ------ | ---------------- |
| 1 | 8.31 | 8.99 | +0.68 | +8.2 % |
| 2 | 9.92 | 10.01 | +0.09 | +0.9 % |
| 4 | 12.11 | 11.52 | –0.59 | –4.9 % |
| 8 | 14.99 | 13.66 | –1.33 | –8.9 % |
| 16 | 18.42 | 15.68 | –2.74 | –14.9 % |
| 32 | 22.12 | 18.03 | –4.09 | –18.5 % |
**ITL Analysis**
- At low concurrency (1‑2) the NVFP4 model is slightly slower.
- From medium to high concurrency (8‑32) the NVFP4 model **outperforms** the full‑precision model, achieving up to **‑18.5 %** lower latency at concurrency 32.
------
### 3. Tokens Per Second (TPS) – tokens / s
| Full Model | NVFP Model |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
|
|
| Concurrency | Full Model | NVFP4 Model | Δ (tokens/s) | Performance Change |
| ----------- | ---------- | ----------- | ------------ | ------------------ |
| 1 | 112.61 | 103.54 | –9.07 | –8.1 % |
| 2 | 91.60 | 88.53 | –3.07 | –3.3 % |
| 4 | 76.61 | 78.11 | +1.50 | +2.0 % |
| 8 | 62.58 | 66.77 | +4.19 | +6.7 % |
| 16 | 51.03 | 58.03 | +7.00 | +13.7 % |
| 32 | 43.37 | 51.75 | +8.38 | +19.3 % |
**TPS Analysis**
- The full‑precision model is faster at low concurrency (1‑2).
- From concurrency 4 upward, the NVFP4 model yields higher throughput, reaching **+19.3 %** at concurrency 32.
------
### 4. Total Latency – seconds
| Full Model | NVFP Model |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
|
|
| Concurrency | Full Model | NVFP4 Model | Δ (s) | Performance Change |
| ----------- | ---------- | ----------- | ----- | ------------------ |
| 1 | 1.12 | 1.23 | +0.11 | +9.8 % |
| 2 | 1.40 | 1.45 | +0.05 | +3.6 % |
| 4 | 1.66 | 1.61 | –0.05 | –3.0 % |
| 8 | 2.03 | 1.90 | –0.13 | –6.4 % |
| 16 | 2.49 | 2.15 | –0.34 | –13.7 % |
| 32 | 2.94 | 2.43 | –0.51 | –17.3 % |
**Latency Analysis**
- Full‑precision model is better at low concurrency.
- NVFP4 model becomes superior as concurrency increases.
------
### 5. Throughput (RPS) – requests / s
| Full Model | NVFP Model |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
|
|
| Concurrency | Full Model | NVFP4 Model | Δ (RPS) | Performance Change |
| ----------- | ---------- | ----------- | ------- | ------------------ |
| 1 | 0.90 | 0.81 | –0.09 | –10.0 % |
| 2 | 0.72 | 0.69 | –0.03 | –4.2 % |
| 4 | 0.60 | 0.62 | +0.02 | +3.3 % |
| 8 | 0.49 | 0.53 | +0.04 | +8.2 % |
| 16 | 0.40 | 0.46 | +0.06 | +15.0 % |
| 32 | 0.34 | 0.41 | +0.07 | +20.6 % |
**Throughput Analysis**
- Full‑precision model wins at very low concurrency.
- NVFP4 model surpasses it from concurrency 4 onward, achieving **+20.6 %** at concurrency 32.