Improve model card: add pipeline tag, library, paper, code, and usage
Browse filesThis PR significantly enhances the model card for the VGT (Visual Generation Tuning) model. Key improvements include:
* Adding the `pipeline_tag: text-to-image` to improve discoverability on the Hugging Face Hub.
* Specifying `library_name: diffusers` based on the project's installation requirements, enabling the "How to use" widget.
* Including a direct link to the paper: [Visual Generation Tuning](https://huggingface.co/papers/2511.23469).
* Providing a link to the official GitHub repository: [hustvl/VGT](https://github.com/hustvl/VGT).
* Adding a "Getting Started" section with installation instructions, pretrained model details, and an inference command, all sourced directly from the GitHub README to help users quickly get started.
* Incorporating key highlights and an overview of VGT's methodology from the GitHub README.
These updates make the model card more informative, discoverable, and user-friendly.
|
@@ -1,3 +1,180 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: text-to-image
|
| 4 |
+
library_name: diffusers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
<div align="center">
|
| 8 |
+
<h2>π VGT: Visual Generation Tuning</h2>
|
| 9 |
+
|
| 10 |
+
<div align="center">
|
| 11 |
+
<img src="https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft/resolve/main/asserts/vgt_logo.png" alt="VGT">
|
| 12 |
+
</div>
|
| 13 |
+
|
| 14 |
+
**_Unleashing Visual Generation Capabilities from Any Pretrained VLM_**
|
| 15 |
+
|
| 16 |
+
**GenEval 0.83 | DPG-Bench 81.28 | 20Γ Faster Convergence**
|
| 17 |
+
|
| 18 |
+
[](https://github.com/hustvl/VGT/blob/main/LICENSE)
|
| 19 |
+
[](https://github.com/hustvl)
|
| 20 |
+
[](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft)
|
| 21 |
+
[](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft)
|
| 22 |
+
[](https://huggingface.co/papers/2511.23469)
|
| 23 |
+
[](https://github.com/hustvl/VGT)
|
| 24 |
+
|
| 25 |
+
</div>
|
| 26 |
+
|
| 27 |
+
<div align="center">
|
| 28 |
+
<img src="https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft/resolve/main/asserts/case_show.png" alt="VGT Generated Images">
|
| 29 |
+
</div>
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## β¨ Highlights
|
| 34 |
+
|
| 35 |
+
- **π― Novel Paradigm**: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning
|
| 36 |
+
- **β‘ 20Γ Speedup**: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models
|
| 37 |
+
- **π SOTA Performance**: GenEval **0.83** and DPG-Bench **81.28** with minimal training data
|
| 38 |
+
- **π Extreme Data Efficiency**: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations
|
| 39 |
+
- **π Parallel Inference**: QueryAR mechanism enables 16Γ parallel decoding while maintaining high-quality generation
|
| 40 |
+
- **π¨ Superior Reconstruction**: 26.67 PSNR and 0.50 rFID at 28Γ compression ratio, outperforming specialized VAEs
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## π‘ What is VGT?
|
| 45 |
+
|
| 46 |
+
**VGT (Visual Generation Tuning)** is a groundbreaking paradigm that answers a fundamental question:
|
| 47 |
+
|
| 48 |
+
> *Can we directly leverage the well-aligned semantic representations in pretrained VLMs to enable visual generation capabilities?*
|
| 49 |
+
|
| 50 |
+
VGT is designed to stimulate the underlying capabilities of visual generation within any Vision-Language Model (VLM). By performing efficient visual generation tuning on well-pretrained VLMs, VGT significantly mitigates the alignment costs and accelerates the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, VGT formulates VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, it achieves 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, it achieves state-of-the-art outcomes among autoregressive models.
|
| 51 |
+
|
| 52 |
+
More details about the core problem and our solution can be found in the [GitHub repository](https://github.com/hustvl/VGT) and the [paper](https://huggingface.co/papers/2511.23469).
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## π Results
|
| 57 |
+
|
| 58 |
+
### Text-to-Image Generation Performance
|
| 59 |
+
|
| 60 |
+
<div align="center">
|
| 61 |
+
<img src="https://github.com/hustvl/VGT/raw/main/asserts/generation_results.png" alt="generation_results">
|
| 62 |
+
</div>
|
| 63 |
+
|
| 64 |
+
**Highlights**:
|
| 65 |
+
- π **SOTA among autoregressive models** with GenEval **0.83** and DPG-Bench **81.28**
|
| 66 |
+
- π― **Extreme data efficiency**: Trained on only **<25M samples** (vs. 198M-2048M+ for competitors)
|
| 67 |
+
- β‘ **Competitive with diffusion models** despite using far less data and compute
|
| 68 |
+
|
| 69 |
+
### Image Reconstruction Quality
|
| 70 |
+
|
| 71 |
+
Evaluated on ImageNet 50K validation set (256Γ256):
|
| 72 |
+
|
| 73 |
+
| Method | Type | Ratio | rFID β | PSNR β | SSIM β |
|
| 74 |
+
|:-------|:-----|:-----:|:------:|:------:|:------:|
|
| 75 |
+
| **Generative-Only Tokenizers** |
|
| 76 |
+
| VQGAN | VQ | 16Γ | 4.98 | 20.00 | 0.629 |
|
| 77 |
+
| LlamaGen | VQ | 16Γ | 2.19 | 20.79 | 0.675 |
|
| 78 |
+
| SD-VAE | VAE | 16Γ | 2.64 | 22.13 | 0.590 |
|
| 79 |
+
| VAR | VAE | 16Γ | 1.00 | 22.63 | 0.755 |
|
| 80 |
+
| Open-MAGVIT2 | VQ | 16Γ | 1.67 | 22.70 | 0.640 |
|
| 81 |
+
| RAE | VAE | 16Γ | 0.49 | 19.23 | 0.620 |
|
| 82 |
+
| DC-AE | VAE | 32Γ | 0.69 | 23.85 | 0.660 |
|
| 83 |
+
| **CLIP-based Tokenizers** |
|
| 84 |
+
| VILA-U | CLIP | 16Γ | 1.80 | - | - |
|
| 85 |
+
| TokenFlow | CLIP | 16Γ | 1.37 | 21.41 | 0.687 |
|
| 86 |
+
| DualViTok | CLIP | 16Γ | 1.37 | 22.53 | 0.741 |
|
| 87 |
+
| UniLIP | CLIP | 32Γ | 0.79 | 22.99 | 0.747 |
|
| 88 |
+
| **VGT (Ours)** |
|
| 89 |
+
| VGT-AE (Qwen2.5VL) | VLM | 28Γ | 1.93 | 20.12 | 0.677 |
|
| 90 |
+
| VGT-AE (InternVL3) | VLM | **28Γ** | **0.50** | **26.67** | **0.863** |
|
| 91 |
+
|
| 92 |
+
**Highlights**:
|
| 93 |
+
- π **Best reconstruction quality**: 26.67 PSNR and 0.50 rFID at 28Γ compression
|
| 94 |
+
- π **Outperforms specialized VAEs**: Superior to SD-VAE, VAR, and other tokenizers
|
| 95 |
+
- π― **Higher compression ratio**: Achieves 28Γ compression vs. 16Γ for most methods
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## π Getting Started
|
| 100 |
+
|
| 101 |
+
### Installation
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
# Clone the repository
|
| 105 |
+
git clone https://github.com/hustvl/VGT.git
|
| 106 |
+
cd VGT
|
| 107 |
+
|
| 108 |
+
# Install dependencies
|
| 109 |
+
conda create -n vgt python=3.10
|
| 110 |
+
conda activate vgt
|
| 111 |
+
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
| 112 |
+
pip install mmengine xtuner tqdm timm
|
| 113 |
+
pip install diffusers transformers==4.57.1
|
| 114 |
+
pip install flash-attn --no-build-isolation
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
### Pretrained Models
|
| 118 |
+
|
| 119 |
+
We provide VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px):
|
| 120 |
+
|
| 121 |
+
| Model | Base Model | GenEval | DPG-Bench | Download |
|
| 122 |
+
|:------|:-----------|:-------:|:---------:|:--------:|
|
| 123 |
+
| VGT-InternVL3-1.6B-Pretrain | InternVL3-1.6B | 0.58 | 73.05 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_pretrain) |
|
| 124 |
+
| VGT-InternVL3-1.6B-SFT | InternVL3-1.6B | 0.83 | 76.33 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft) |
|
| 125 |
+
| VGT-Qwen2.5-VL-2B-Pretrain | Qwen2.5-VL-2B | 0.63 | 78.02 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_pretrain) |
|
| 126 |
+
| VGT-Qwen2.5-VL-2B-SFT | Qwen2.5-VL-2B | 0.83 | 81.28 | [π€ HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft) |
|
| 127 |
+
|
| 128 |
+
### Inference
|
| 129 |
+
|
| 130 |
+
Download the sft model checkpoint:
|
| 131 |
+
|
| 132 |
+
```bash
|
| 133 |
+
cd VGT
|
| 134 |
+
mkdir ckpts
|
| 135 |
+
hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
|
| 136 |
+
hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
Generate images from text prompts:
|
| 140 |
+
|
| 141 |
+
```bash
|
| 142 |
+
export PYTHONPATH=./:$PYTHONPATH
|
| 143 |
+
|
| 144 |
+
# use InternVL3-1.6B generate
|
| 145 |
+
python scripts/sample_text_list_vgt_intervl3_0.6B.py
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
> Note: We found that under the same training method, VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B performs better in generating landscapes, light and shadow, and animals. You can explore on your own.
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## π Acknowledgements
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
We gratefully acknowledge the following open-source projects: [XTuner](https://github.com/InternLM/xtuner), [MMEngine](https://github.com/open-mmlab/mmengine), [BLIP-3o](https://github.com/JiuhaiChen/BLIP3o), [ShareGPT4o](https://sharegpt4o.github.io/), [Echo-4o](https://github.com/yejy53/Nano-banana-150k), [NextStep-1](https://github.com/stepfun-ai/NextStep-1), [OpenUni](https://github.com/wusize/OpenUni), [UniLIP](https://github.com/nnnth/UniLIP), and [TiTok](https://github.com/bytedance/1d-tokenizer).
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
+
|
| 160 |
+
## π Citation
|
| 161 |
+
|
| 162 |
+
If you find our work useful, please cite our paper:
|
| 163 |
+
|
| 164 |
+
```bibtex
|
| 165 |
+
@misc{guo2025vgt,
|
| 166 |
+
title={Visual Generation Tuning},
|
| 167 |
+
author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
|
| 168 |
+
year={2025},
|
| 169 |
+
eprint={2511.23469},
|
| 170 |
+
archivePrefix={arXiv},
|
| 171 |
+
}
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
## π§ Contact
|
| 177 |
+
|
| 178 |
+
- **Author**: Jiahao Guo ([email protected])
|
| 179 |
+
- **Project Lead**: Kai Wu ([email protected])
|
| 180 |
+
- **Corresponding Author**: Xinggang Wang ([email protected])
|