IQuest-Coder-V1-40B-Loop-Instruct NVFP4 (Blackwell Optimized)

📄 Executive Summary

This repository hosts the officially optimized NVFP4 distribution of the IQuest-Loop-Coder. By leveraging NVIDIA's 4-bit floating point format (NVFP4), we have compressed the massive 80GB BF16 baseline into a ~21.9 GB high-performance engine, specifically tailored for NVIDIA Blackwell (SM 100/121) hardware.

Note on Model Size: While Hugging Face may report ~20B parameters, this is a Recursive "Loop" Architecture. With loop_num=2, the model effectively executes 40B parameters worth of compute per token generation (80 layers × 2 passes), achieving reasoning depth far exceeding standard 20B models.

🏗 The Engineering Challenge: Quantizing Recursion

Standard quantization tools (like AutoGPTQ or unmodified ModelOpt) fail on the IQuestLoopCoder architecture because of its unique Dual-ModuleList Recursive Structure. The logic loops back on itself, causing standard tracing algorithms to either crash or miss 50% of the computation graph.

📜 The Solution: A Custom Engineering Pipeline

To achieve this coherent 4-bit export, we implemented a specialized 3-stage pipeline:

1. Instrumentation Patching (`layer_utils.py`)

We manually patched the NVIDIA modelopt library to support "Unrolled Tracing". By injecting custom hooks into the IQuestLoopCoderModel definition, we forced the quantizer to "see" the unrolled execution path.

Problem: Standard tracers see one pass and stop.
Fix: Our patch force-traces the loop_num iterations, ensuring the quantization scales are calibrated for the accumulated activation variance of both loops.

2. Deep-Logic Calibration (AWQ-Aware)

Blindly quantizing weights destroys reasoning capabilities. We used Activation-aware Weight Quantization (AWQ) with a high-precision calibration dataset sourced from MBPP (Mostly Basic Python Problems).

Process: We fed 64 complex Python coding problems through the unrolled model.
Result: The algorithm identified the "Salient Weights" (the 1% of parameters most critical for logic) and protected them with higher-precision scaling factors, while compressing the robust weights to 4-bit.

3. Unified HF Export (Engine Agnostic)

The final artifact is not a proprietary blob but a standard Unified Hugging Face Checkpoint.

Container: safetensors
Weights: uint8 (Packed 4-bit payloads)
Scales: float8_e4m3fn (High-dynamic-range scaling factors)
Compatibility: Natively loadable by vLLM (Blackwell kernels), TensorRT-LLM, and SGLang.

📊 Performance & Verification

This model has been rigorously stress-tested on Blackwell hardware to ensure the "Loop Tax" (the latency cost of recursion) is outweighed by the precision gains.

Metric	Result	Context
Inference Speed	~3.86 tokens/s	Verified on Single Blackwell GPU (Eager Mode)
VRAM Usage	22.15 GiB	Massive reduction from ~80GB (BF16)
Coherence	100% Pass	Tested on Nested Python Loops & Mathematical Induction
Stop Behavior	Perfect	Stability Patch applied to `tokenizer_config.json`

🐳 Quick Start: Deployment

Option 1: vLLM (Recommended)

The official vLLM image (vllm-blackwell-official) contains the specific kernels required to decode this NVFP4 format.

vllm serve Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4 \
    --quantization modelopt \
    --trust-remote-code \
    --enforce-eager \
    --gpu-memory-utilization 0.85

Option 2: Docker Compose (Production Ready)

Use this docker-compose.yaml for a turnkey generic deployment on port 8000:

services:
  vllm-iquest-deploy:
    image: vllm-blackwell-official:latest
    container_name: iquest-coder-server
    environment:
      - VLLM_USE_V1=1
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - .:/model
    ports:
      - "8000:8000"
    command: >
      --model Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4
      --served-model-name iquest-coder-40b-loop
      --quantization modelopt
      --trust-remote-code
      --enforce-eager
      --tensor-parallel-size 1
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

To run:

docker compose up -d

🧩 Technical Specifications

Base Model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
Quantization Algorithm: NVIDIA ModelOpt (AWQ + FP8 Scales)
Weight Layout: Packed 4-bit (uint8 container)
Scale Format: e4m3fn (FP8)
Group Size: 128 (Standard)

Tensor Type Explanation

If viewing file info, you may see mixed types:

BF16: Model metadata / configuration tensors.
U8: The actual compressed 4-bit weights (2 packed per byte).
F8_E4M3: The scaling factors used to decompress the U8 weights at runtime.

📄 License & Credits

Original Architecture: IQuest Lab
NVFP4 Optimization: Elias Schwegler
Tools Used: NVIDIA ModelOpt, vLLM Project
License: Apache 2.0

Downloads last month: 494

Safetensors

Model size

20B params

Tensor type

BF16

F8_E4M3

Model tree for Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4

Base model

IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct

Quantized

(7)

this model