Text-to-Image
nielsr HF Staff commited on
Commit
bd44e03
Β·
verified Β·
1 Parent(s): e9ea28f

Improve model card: add pipeline tag, library, paper, code, and usage

Browse files

This PR significantly enhances the model card for the VGT (Visual Generation Tuning) model. Key improvements include:

* Adding the `pipeline_tag: text-to-image` to improve discoverability on the Hugging Face Hub.
* Specifying `library_name: diffusers` based on the project's installation requirements, enabling the "How to use" widget.
* Including a direct link to the paper: [Visual Generation Tuning](https://huggingface.co/papers/2511.23469).
* Providing a link to the official GitHub repository: [hustvl/VGT](https://github.com/hustvl/VGT).
* Adding a "Getting Started" section with installation instructions, pretrained model details, and an inference command, all sourced directly from the GitHub README to help users quickly get started.
* Incorporating key highlights and an overview of VGT's methodology from the GitHub README.

These updates make the model card more informative, discoverable, and user-friendly.

Files changed (1) hide show
  1. README.md +180 -3
README.md CHANGED
@@ -1,3 +1,180 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-to-image
4
+ library_name: diffusers
5
+ ---
6
+
7
+ <div align="center">
8
+ <h2>πŸš€ VGT: Visual Generation Tuning</h2>
9
+
10
+ <div align="center">
11
+ <img src="https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft/resolve/main/asserts/vgt_logo.png" alt="VGT">
12
+ </div>
13
+
14
+ **_Unleashing Visual Generation Capabilities from Any Pretrained VLM_**
15
+
16
+ **GenEval 0.83 | DPG-Bench 81.28 | 20Γ— Faster Convergence**
17
+
18
+ [![license](https://img.shields.io/badge/license-MIT-blue)](https://github.com/hustvl/VGT/blob/main/LICENSE)
19
+ [![authors](https://img.shields.io/badge/by-hustvl-green)](https://github.com/hustvl)
20
+ [![model](https://img.shields.io/badge/πŸ€—-VGT_Qwen2.5VL_2B-yellow)](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft)
21
+ [![model](https://img.shields.io/badge/πŸ€—-VGT_InternVL3_1.6B-yellow)](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft)
22
+ [![paper](https://img.shields.io/badge/arXiv-Paper-red)](https://huggingface.co/papers/2511.23469)
23
+ [![GitHub Code](https://img.shields.io/badge/GitHub-Code-black?style=flat&logo=github)](https://github.com/hustvl/VGT)
24
+
25
+ </div>
26
+
27
+ <div align="center">
28
+ <img src="https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft/resolve/main/asserts/case_show.png" alt="VGT Generated Images">
29
+ </div>
30
+
31
+ ---
32
+
33
+ ## ✨ Highlights
34
+
35
+ - **🎯 Novel Paradigm**: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning
36
+ - **⚑ 20Γ— Speedup**: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models
37
+ - **πŸ“Š SOTA Performance**: GenEval **0.83** and DPG-Bench **81.28** with minimal training data
38
+ - **πŸš€ Extreme Data Efficiency**: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations
39
+ - **πŸ”„ Parallel Inference**: QueryAR mechanism enables 16Γ— parallel decoding while maintaining high-quality generation
40
+ - **🎨 Superior Reconstruction**: 26.67 PSNR and 0.50 rFID at 28Γ— compression ratio, outperforming specialized VAEs
41
+
42
+ ---
43
+
44
+ ## πŸ’‘ What is VGT?
45
+
46
+ **VGT (Visual Generation Tuning)** is a groundbreaking paradigm that answers a fundamental question:
47
+
48
+ > *Can we directly leverage the well-aligned semantic representations in pretrained VLMs to enable visual generation capabilities?*
49
+
50
+ VGT is designed to stimulate the underlying capabilities of visual generation within any Vision-Language Model (VLM). By performing efficient visual generation tuning on well-pretrained VLMs, VGT significantly mitigates the alignment costs and accelerates the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, VGT formulates VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, it achieves 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, it achieves state-of-the-art outcomes among autoregressive models.
51
+
52
+ More details about the core problem and our solution can be found in the [GitHub repository](https://github.com/hustvl/VGT) and the [paper](https://huggingface.co/papers/2511.23469).
53
+
54
+ ---
55
+
56
+ ## πŸ“Š Results
57
+
58
+ ### Text-to-Image Generation Performance
59
+
60
+ <div align="center">
61
+ <img src="https://github.com/hustvl/VGT/raw/main/asserts/generation_results.png" alt="generation_results">
62
+ </div>
63
+
64
+ **Highlights**:
65
+ - πŸ† **SOTA among autoregressive models** with GenEval **0.83** and DPG-Bench **81.28**
66
+ - 🎯 **Extreme data efficiency**: Trained on only **<25M samples** (vs. 198M-2048M+ for competitors)
67
+ - ⚑ **Competitive with diffusion models** despite using far less data and compute
68
+
69
+ ### Image Reconstruction Quality
70
+
71
+ Evaluated on ImageNet 50K validation set (256Γ—256):
72
+
73
+ | Method | Type | Ratio | rFID ↓ | PSNR ↑ | SSIM ↑ |
74
+ |:-------|:-----|:-----:|:------:|:------:|:------:|
75
+ | **Generative-Only Tokenizers** |
76
+ | VQGAN | VQ | 16Γ— | 4.98 | 20.00 | 0.629 |
77
+ | LlamaGen | VQ | 16Γ— | 2.19 | 20.79 | 0.675 |
78
+ | SD-VAE | VAE | 16Γ— | 2.64 | 22.13 | 0.590 |
79
+ | VAR | VAE | 16Γ— | 1.00 | 22.63 | 0.755 |
80
+ | Open-MAGVIT2 | VQ | 16Γ— | 1.67 | 22.70 | 0.640 |
81
+ | RAE | VAE | 16Γ— | 0.49 | 19.23 | 0.620 |
82
+ | DC-AE | VAE | 32Γ— | 0.69 | 23.85 | 0.660 |
83
+ | **CLIP-based Tokenizers** |
84
+ | VILA-U | CLIP | 16Γ— | 1.80 | - | - |
85
+ | TokenFlow | CLIP | 16Γ— | 1.37 | 21.41 | 0.687 |
86
+ | DualViTok | CLIP | 16Γ— | 1.37 | 22.53 | 0.741 |
87
+ | UniLIP | CLIP | 32Γ— | 0.79 | 22.99 | 0.747 |
88
+ | **VGT (Ours)** |
89
+ | VGT-AE (Qwen2.5VL) | VLM | 28Γ— | 1.93 | 20.12 | 0.677 |
90
+ | VGT-AE (InternVL3) | VLM | **28Γ—** | **0.50** | **26.67** | **0.863** |
91
+
92
+ **Highlights**:
93
+ - πŸ† **Best reconstruction quality**: 26.67 PSNR and 0.50 rFID at 28Γ— compression
94
+ - πŸ“Š **Outperforms specialized VAEs**: Superior to SD-VAE, VAR, and other tokenizers
95
+ - 🎯 **Higher compression ratio**: Achieves 28Γ— compression vs. 16Γ— for most methods
96
+
97
+ ---
98
+
99
+ ## πŸš€ Getting Started
100
+
101
+ ### Installation
102
+
103
+ ```bash
104
+ # Clone the repository
105
+ git clone https://github.com/hustvl/VGT.git
106
+ cd VGT
107
+
108
+ # Install dependencies
109
+ conda create -n vgt python=3.10
110
+ conda activate vgt
111
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
112
+ pip install mmengine xtuner tqdm timm
113
+ pip install diffusers transformers==4.57.1
114
+ pip install flash-attn --no-build-isolation
115
+ ```
116
+
117
+ ### Pretrained Models
118
+
119
+ We provide VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px):
120
+
121
+ | Model | Base Model | GenEval | DPG-Bench | Download |
122
+ |:------|:-----------|:-------:|:---------:|:--------:|
123
+ | VGT-InternVL3-1.6B-Pretrain | InternVL3-1.6B | 0.58 | 73.05 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_pretrain) |
124
+ | VGT-InternVL3-1.6B-SFT | InternVL3-1.6B | 0.83 | 76.33 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft) |
125
+ | VGT-Qwen2.5-VL-2B-Pretrain | Qwen2.5-VL-2B | 0.63 | 78.02 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_pretrain) |
126
+ | VGT-Qwen2.5-VL-2B-SFT | Qwen2.5-VL-2B | 0.83 | 81.28 | [πŸ€— HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft) |
127
+
128
+ ### Inference
129
+
130
+ Download the sft model checkpoint:
131
+
132
+ ```bash
133
+ cd VGT
134
+ mkdir ckpts
135
+ hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
136
+ hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft
137
+ ```
138
+
139
+ Generate images from text prompts:
140
+
141
+ ```bash
142
+ export PYTHONPATH=./:$PYTHONPATH
143
+
144
+ # use InternVL3-1.6B generate
145
+ python scripts/sample_text_list_vgt_intervl3_0.6B.py
146
+ ```
147
+
148
+
149
+ > Note: We found that under the same training method, VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B performs better in generating landscapes, light and shadow, and animals. You can explore on your own.
150
+
151
+ ---
152
+
153
+ ## πŸ™ Acknowledgements
154
+
155
+
156
+ We gratefully acknowledge the following open-source projects: [XTuner](https://github.com/InternLM/xtuner), [MMEngine](https://github.com/open-mmlab/mmengine), [BLIP-3o](https://github.com/JiuhaiChen/BLIP3o), [ShareGPT4o](https://sharegpt4o.github.io/), [Echo-4o](https://github.com/yejy53/Nano-banana-150k), [NextStep-1](https://github.com/stepfun-ai/NextStep-1), [OpenUni](https://github.com/wusize/OpenUni), [UniLIP](https://github.com/nnnth/UniLIP), and [TiTok](https://github.com/bytedance/1d-tokenizer).
157
+
158
+ ---
159
+
160
+ ## πŸ“ Citation
161
+
162
+ If you find our work useful, please cite our paper:
163
+
164
+ ```bibtex
165
+ @misc{guo2025vgt,
166
+ title={Visual Generation Tuning},
167
+ author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
168
+ year={2025},
169
+ eprint={2511.23469},
170
+ archivePrefix={arXiv},
171
+ }
172
+ ```
173
+
174
+ ---
175
+
176
+ ## πŸ“§ Contact
177
+
178
+ - **Author**: Jiahao Guo ([email protected])
179
+ - **Project Lead**: Kai Wu ([email protected])
180
+ - **Corresponding Author**: Xinggang Wang ([email protected])