IQuest-Coder-V1-40B-Loop-Instruct NVFP4 (Blackwell Optimized)

Loop Transformer NVFP4 ModelOpt Blackwell

馃搫 Executive Summary

This repository hosts the officially optimized NVFP4 distribution of the IQuest-Loop-Coder. By leveraging NVIDIA's 4-bit floating point format (NVFP4), we have compressed the massive 80GB BF16 baseline into a ~21.9 GB high-performance engine, specifically tailored for NVIDIA Blackwell (SM 100/121) hardware.

Note on Model Size: While Hugging Face may report ~20B parameters, this is a Recursive "Loop" Architecture. With loop_num=2, the model effectively executes 40B parameters worth of compute per token generation (80 layers 脳 2 passes), achieving reasoning depth far exceeding standard 20B models.


馃彈 The Engineering Challenge: Quantizing Recursion

Standard quantization tools (like AutoGPTQ or unmodified ModelOpt) fail on the IQuestLoopCoder architecture because of its unique Dual-ModuleList Recursive Structure. The logic loops back on itself, causing standard tracing algorithms to either crash or miss 50% of the computation graph.

馃摐 The Solution: A Custom Engineering Pipeline

To achieve this coherent 4-bit export, we implemented a specialized 3-stage pipeline:

1. Instrumentation Patching (layer_utils.py)

We manually patched the NVIDIA modelopt library to support "Unrolled Tracing". By injecting custom hooks into the IQuestLoopCoderModel definition, we forced the quantizer to "see" the unrolled execution path.

  • Problem: Standard tracers see one pass and stop.
  • Fix: Our patch force-traces the loop_num iterations, ensuring the quantization scales are calibrated for the accumulated activation variance of both loops.

2. Deep-Logic Calibration (AWQ-Aware)

Blindly quantizing weights destroys reasoning capabilities. We used Activation-aware Weight Quantization (AWQ) with a high-precision calibration dataset sourced from MBPP (Mostly Basic Python Problems).

  • Process: We fed 64 complex Python coding problems through the unrolled model.
  • Result: The algorithm identified the "Salient Weights" (the 1% of parameters most critical for logic) and protected them with higher-precision scaling factors, while compressing the robust weights to 4-bit.

3. Unified HF Export (Engine Agnostic)

The final artifact is not a proprietary blob but a standard Unified Hugging Face Checkpoint.

  • Container: safetensors
  • Weights: uint8 (Packed 4-bit payloads)
  • Scales: float8_e4m3fn (High-dynamic-range scaling factors)
  • Compatibility: Natively loadable by vLLM (Blackwell kernels), TensorRT-LLM, and SGLang.

馃搳 Performance & Verification

This model has been rigorously stress-tested on Blackwell hardware to ensure the "Loop Tax" (the latency cost of recursion) is outweighed by the precision gains.

Metric Result Context
Inference Speed ~3.86 tokens/s Verified on Single Blackwell GPU (Eager Mode)
VRAM Usage 22.15 GiB Massive reduction from ~80GB (BF16)
Coherence 100% Pass Tested on Nested Python Loops & Mathematical Induction
Stop Behavior Perfect Stability Patch applied to tokenizer_config.json

馃惓 Quick Start: Deployment

Option 1: vLLM (Recommended)

The official vLLM image (vllm-blackwell-official) contains the specific kernels required to decode this NVFP4 format.

vllm serve Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4 \
    --quantization modelopt \
    --trust-remote-code \
    --enforce-eager \
    --gpu-memory-utilization 0.85

Option 2: Docker Compose (Production Ready)

Use this docker-compose.yaml for a turnkey generic deployment on port 8000:

services:
  vllm-iquest-deploy:
    image: vllm-blackwell-official:latest
    container_name: iquest-coder-server
    environment:
      - VLLM_USE_V1=1
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      - .:/model
    ports:
      - "8000:8000"
    command: >
      --model Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4
      --served-model-name iquest-coder-40b-loop
      --quantization modelopt
      --trust-remote-code
      --enforce-eager
      --tensor-parallel-size 1
      --gpu-memory-utilization 0.85
      --max-model-len 32768
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

To run:

docker compose up -d

馃З Technical Specifications

  • Base Model: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
  • Quantization Algorithm: NVIDIA ModelOpt (AWQ + FP8 Scales)
  • Weight Layout: Packed 4-bit (uint8 container)
  • Scale Format: e4m3fn (FP8)
  • Group Size: 128 (Standard)

Tensor Type Explanation

If viewing file info, you may see mixed types:

  • BF16: Model metadata / configuration tensors.
  • U8: The actual compressed 4-bit weights (2 packed per byte).
  • F8_E4M3: The scaling factors used to decompress the U8 weights at runtime.

馃搫 License & Credits

  • Original Architecture: IQuest Lab
  • NVFP4 Optimization: Elias Schwegler
  • Tools Used: NVIDIA ModelOpt, vLLM Project
  • License: Apache 2.0
Downloads last month
494
Safetensors
Model size
20B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Elias-Schwegler/IQuest-Coder-V1-40B-Loop-Instruct-NVFP4

Quantized
(7)
this model