Improve model card: add pipeline tag, library, paper, code, and usage

This PR significantly enhances the model card for the VGT (Visual Generation Tuning) model. Key improvements include:

* Adding the `pipeline_tag: text-to-image` to improve discoverability on the Hugging Face Hub.
* Specifying `library_name: diffusers` based on the project's installation requirements, enabling the "How to use" widget.
* Including a direct link to the paper: [Visual Generation Tuning](https://huggingface.co/papers/2511.23469).
* Providing a link to the official GitHub repository: [hustvl/VGT](https://github.com/hustvl/VGT).
* Adding a "Getting Started" section with installation instructions, pretrained model details, and an inference command, all sourced directly from the GitHub README to help users quickly get started.
* Incorporating key highlights and an overview of VGT's methodology from the GitHub README.

These updates make the model card more informative, discoverable, and user-friendly.

Files changed (1) hide show

README.md +180 -3

README.md CHANGED Viewed

@@ -1,3 +1,180 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: text-to-image
+library_name: diffusers
+---
+<div align="center">
+<h2>🚀 VGT: Visual Generation Tuning</h2>
+<div align="center">
+<img src="https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft/resolve/main/asserts/vgt_logo.png" alt="VGT">
+</div>
+**_Unleashing Visual Generation Capabilities from Any Pretrained VLM_**
+**GenEval 0.83 | DPG-Bench 81.28 | 20× Faster Convergence**
+[![license](https://img.shields.io/badge/license-MIT-blue)](https://github.com/hustvl/VGT/blob/main/LICENSE)
+[![authors](https://img.shields.io/badge/by-hustvl-green)](https://github.com/hustvl)
+[![model](https://img.shields.io/badge/🤗-VGT_Qwen2.5VL_2B-yellow)](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft)
+[![model](https://img.shields.io/badge/🤗-VGT_InternVL3_1.6B-yellow)](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft)
+[![paper](https://img.shields.io/badge/arXiv-Paper-red)](https://huggingface.co/papers/2511.23469)
+[![GitHub Code](https://img.shields.io/badge/GitHub-Code-black?style=flat&logo=github)](https://github.com/hustvl/VGT)
+</div>
+<div align="center">
+<img src="https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft/resolve/main/asserts/case_show.png" alt="VGT Generated Images">
+</div>
+---
+## ✨ Highlights
+- **🎯 Novel Paradigm**: Transform ANY pretrained Vision-Language Model into a powerful image generator through efficient visual generation tuning
+- **⚡ 20× Speedup**: Achieve dramatically faster convergence compared to vanilla VAE-based autoregressive models
+- **📊 SOTA Performance**: GenEval **0.83** and DPG-Bench **81.28** with minimal training data
+- **🚀 Extreme Data Efficiency**: Reach GenEval 0.55 in just 10K iterations, 0.60 in 30K iterations
+- **🔄 Parallel Inference**: QueryAR mechanism enables 16× parallel decoding while maintaining high-quality generation
+- **🎨 Superior Reconstruction**: 26.67 PSNR and 0.50 rFID at 28× compression ratio, outperforming specialized VAEs
+---
+## 💡 What is VGT?
+**VGT (Visual Generation Tuning)** is a groundbreaking paradigm that answers a fundamental question:
+> *Can we directly leverage the well-aligned semantic representations in pretrained VLMs to enable visual generation capabilities?*
+VGT is designed to stimulate the underlying capabilities of visual generation within any Vision-Language Model (VLM). By performing efficient visual generation tuning on well-pretrained VLMs, VGT significantly mitigates the alignment costs and accelerates the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, VGT formulates VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, it achieves 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, it achieves state-of-the-art outcomes among autoregressive models.
+More details about the core problem and our solution can be found in the [GitHub repository](https://github.com/hustvl/VGT) and the [paper](https://huggingface.co/papers/2511.23469).
+---
+## 📊 Results
+### Text-to-Image Generation Performance
+<div align="center">
+<img src="https://github.com/hustvl/VGT/raw/main/asserts/generation_results.png" alt="generation_results">
+</div>
+**Highlights**:
+- 🏆 **SOTA among autoregressive models** with GenEval **0.83** and DPG-Bench **81.28**
+- 🎯 **Extreme data efficiency**: Trained on only **<25M samples** (vs. 198M-2048M+ for competitors)
+- ⚡ **Competitive with diffusion models** despite using far less data and compute
+### Image Reconstruction Quality
+Evaluated on ImageNet 50K validation set (256×256):
+| Method | Type | Ratio | rFID ↓ | PSNR ↑ | SSIM ↑ |
+|:-------|:-----|:-----:|:------:|:------:|:------:|
+| **Generative-Only Tokenizers** |
+| VQGAN | VQ | 16× | 4.98 | 20.00 | 0.629 |
+| LlamaGen | VQ | 16× | 2.19 | 20.79 | 0.675 |
+| SD-VAE | VAE | 16× | 2.64 | 22.13 | 0.590 |
+| VAR | VAE | 16× | 1.00 | 22.63 | 0.755 |
+| Open-MAGVIT2 | VQ | 16× | 1.67 | 22.70 | 0.640 |
+| RAE | VAE | 16× | 0.49 | 19.23 | 0.620 |
+| DC-AE | VAE | 32× | 0.69 | 23.85 | 0.660 |
+| **CLIP-based Tokenizers** |
+| VILA-U | CLIP | 16× | 1.80 | - | - |
+| TokenFlow | CLIP | 16× | 1.37 | 21.41 | 0.687 |
+| DualViTok | CLIP | 16× | 1.37 | 22.53 | 0.741 |
+| UniLIP | CLIP | 32× | 0.79 | 22.99 | 0.747 |
+| **VGT (Ours)** |
+| VGT-AE (Qwen2.5VL) | VLM | 28× | 1.93 | 20.12 | 0.677 |
+| VGT-AE (InternVL3) | VLM | **28×** | **0.50** | **26.67** | **0.863** |
+**Highlights**:
+- 🏆 **Best reconstruction quality**: 26.67 PSNR and 0.50 rFID at 28× compression
+- 📊 **Outperforms specialized VAEs**: Superior to SD-VAE, VAR, and other tokenizers
+- 🎯 **Higher compression ratio**: Achieves 28× compression vs. 16× for most methods
+---
+## 🚀 Getting Started
+### Installation
+```bash
+# Clone the repository
+git clone https://github.com/hustvl/VGT.git
+cd VGT
+# Install dependencies
+conda create -n vgt python=3.10
+conda activate vgt
+pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
+pip install mmengine xtuner tqdm timm
+pip install diffusers transformers==4.57.1
+pip install flash-attn --no-build-isolation
+```
+### Pretrained Models
+We provide VGT-tuned models based on Qwen2.5-VL and InternVL3 (448px):
+| Model | Base Model | GenEval | DPG-Bench | Download |
+|:------|:-----------|:-------:|:---------:|:--------:|
+| VGT-InternVL3-1.6B-Pretrain | InternVL3-1.6B | 0.58 | 73.05 | [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_pretrain) |
+| VGT-InternVL3-1.6B-SFT | InternVL3-1.6B | 0.83 | 76.33 | [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_internvl3_1_6B_sft) |
+| VGT-Qwen2.5-VL-2B-Pretrain | Qwen2.5-VL-2B | 0.63 | 78.02 | [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_pretrain) |
+| VGT-Qwen2.5-VL-2B-SFT | Qwen2.5-VL-2B | 0.83 | 81.28 | [🤗 HuggingFace](https://huggingface.co/hustvl/vgt_qwen25vl_2B_sft) |
+### Inference
+Download the sft model checkpoint:
+```bash
+cd VGT
+mkdir ckpts
+hf download hustvl/vgt_qwen25vl_2B_sft --repo-type model --local-dir ckpts/hustvl/vgt_qwen25vl_2B_sft
+hf download hustvl/vgt_internvl3_1_6B_sft --repo-type model --local-dir ckpts/hustvl/vgt_internvl3_1_6B_sft
+```
+Generate images from text prompts:
+```bash
+export PYTHONPATH=./:$PYTHONPATH
+# use InternVL3-1.6B generate
+python scripts/sample_text_list_vgt_intervl3_0.6B.py
+```
+> Note: We found that under the same training method, VGT-Qwen2.5-VL-2B performs better in face generation, while VGT-InternVL3-1.6B performs better in generating landscapes, light and shadow, and animals. You can explore on your own.
+---
+## 🙏 Acknowledgements
+We gratefully acknowledge the following open-source projects: [XTuner](https://github.com/InternLM/xtuner), [MMEngine](https://github.com/open-mmlab/mmengine), [BLIP-3o](https://github.com/JiuhaiChen/BLIP3o), [ShareGPT4o](https://sharegpt4o.github.io/), [Echo-4o](https://github.com/yejy53/Nano-banana-150k), [NextStep-1](https://github.com/stepfun-ai/NextStep-1), [OpenUni](https://github.com/wusize/OpenUni), [UniLIP](https://github.com/nnnth/UniLIP), and [TiTok](https://github.com/bytedance/1d-tokenizer).
+---
+## 📝 Citation
+If you find our work useful, please cite our paper:
+```bibtex
+@misc{guo2025vgt,
+      title={Visual Generation Tuning},
+      author={Jiahao Guo and Sinan Du and Jingfeng Yao and Wenyu Liu and Bo Li and Haoxiang Cao and Kun Gai and Chun Yuan and Kai Wu and Xinggang Wang},
+      year={2025},
+      eprint={2511.23469},
+      archivePrefix={arXiv},
+}
+```
+---
+## 📧 Contact
+- **Author**: Jiahao Guo ([email protected])
+- **Project Lead**: Kai Wu ([email protected])
+- **Corresponding Author**: Xinggang Wang ([email protected])