IQuest-Coder-V1-40B-Loop-Instruct NVFP4 (Blackwell Optimized)
馃搫 Executive Summary
This repository hosts the officially optimized NVFP4 distribution of the IQuest-Loop-Coder. By leveraging NVIDIA's 4-bit floating point format (NVFP4), we have compressed the massive 80GB BF16 baseline into a ~21.9 GB high-performance engine, specifically tailored for NVIDIA Blackwell (SM 100/121) hardware.
Note on Model Size: While Hugging Face may report ~20B parameters, this is a Recursive "Loop" Architecture. With
loop_num=2, the model effectively executes 40B parameters worth of compute per token generation (80 layers 脳 2 passes), achieving reasoning depth far exceeding standard 20B models.
馃彈 The Engineering Challenge: Quantizing Recursion
Standard quantization tools (like AutoGPTQ or unmodified ModelOpt) fail on the IQuestLoopCoder architecture because of its unique Dual-ModuleList Recursive Structure. The logic loops back on itself, causing standard tracing algorithms to either crash or miss 50% of the computation graph.
馃摐 The Solution: A Custom Engineering Pipeline
To achieve this coherent 4-bit export, we implemented a specialized 3-stage pipeline:
1. Instrumentation Patching (layer_utils.py)
We manually patched the NVIDIA modelopt library to support "Unrolled Tracing". By injecting custom hooks into the IQuestLoopCoderModel definition, we forced the quantizer to "see" the unrolled execution path.
- Problem: Standard tracers see one pass and stop.
- Fix: Our patch force-traces the
loop_numiterations, ensuring the quantization scales are calibrated for the accumulated activation variance of both loops.
2. Deep-Logic Calibration (AWQ-Aware)
Blindly quantizing weights destroys reasoning capabilities. We used Activation-aware Weight Quantization (AWQ) with a high-precision calibration dataset sourced from MBPP (Mostly Basic Python Problems).
- Process: We fed 64 complex Python coding problems through the unrolled model.
- Result: The algorithm identified the "Salient Weights" (the 1% of parameters most critical for logic) and protected them with higher-precision scaling factors, while compressing the robust weights to 4-bit.
3. Unified HF Export (Engine Agnostic)
The final artifact is not a proprietary blob but a standard Unified Hugging Face Checkpoint.
- Container:
safetensors - Weights:
uint8(Packed 4-bit payloads) - Scales:
float8_e4m3fn(High-dynamic-range scaling factors) - Compatibility: Natively loadable by vLLM (Blackwell kernels), TensorRT-LLM, and SGLang.
馃搳 Performance & Verification
This model has been rigorously stress-tested on Blackwell hardware to ensure the "Loop Tax" (the latency cost of recursion) is outweighed by the precision gains.
| Metric | Result | Context |
|---|---|---|
| Inference Speed | ~3.86 tokens/s | Verified on Single Blackwell GPU (Eager Mode) |
| VRAM Usage | 22.15 GiB | Massive reduction from ~80GB (BF16) |
| Coherence | 100% Pass | Tested on Nested Python Loops & Mathematical Induction |
| Stop Behavior | Perfect | Stability Patch applied to tokenizer_config.json |
馃惓 Quick Start: Deployment
Option 1: vLLM (Recommended)
The official vLLM image (vllm-blackwell-official) contains the specific kernels required to decode this NVFP4 format.
vllm serve Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4 \
--quantization modelopt \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.85
Option 2: Docker Compose (Production Ready)
Use this docker-compose.yaml for a turnkey generic deployment on port 8000:
services:
vllm-iquest-deploy:
image: vllm-blackwell-official:latest
container_name: iquest-coder-server
environment:
- VLLM_USE_V1=1
- NVIDIA_VISIBLE_DEVICES=all
volumes:
- .:/model
ports:
- "8000:8000"
command: >
--model Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4
--served-model-name iquest-coder-40b-loop
--quantization modelopt
--trust-remote-code
--enforce-eager
--tensor-parallel-size 1
--gpu-memory-utilization 0.85
--max-model-len 32768
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
To run:
docker compose up -d
馃З Technical Specifications
- Base Model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
- Quantization Algorithm: NVIDIA ModelOpt (AWQ + FP8 Scales)
- Weight Layout: Packed 4-bit (
uint8container) - Scale Format:
e4m3fn(FP8) - Group Size: 128 (Standard)
Tensor Type Explanation
If viewing file info, you may see mixed types:
BF16: Model metadata / configuration tensors.U8: The actual compressed 4-bit weights (2 packed per byte).F8_E4M3: The scaling factors used to decompress the U8 weights at runtime.
馃搫 License & Credits
- Original Architecture: IQuest Lab
- NVFP4 Optimization: Elias Schwegler
- Tools Used: NVIDIA ModelOpt, vLLM Project
- License: Apache 2.0
- Downloads last month
- 494
Model tree for Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4
Base model
IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct