--- license: apache-2.0 base_model: - Qwen/Qwen2.5-VL-3B-Instruct tags: - object-detection - multimodal - REC - VLM - zero-shot-object-detection language: - zh - en --- # VLM-FO1: Qwen2.5-VL-3B-v01 This repository contains the VLM-FO1_Qwen2.5-VL-3B-v01 model, an implementation of the [VLM-FO1](https://github.com/om-ai-lab/VLM-FO1) framework built on the Qwen2.5-VL-3B base model. VLM-FO1 is a novel plug-and-play framework designed to bridge the gap between the high-level reasoning of Vision-Language Models (VLMs) and the need for fine-grained visual perception. ## Model Details ### Model Description VLM-FO1 endows pre-trained VLMs with superior fine-grained perception without compromising their inherent high-level reasoning and general understanding capabilities. It operates as a plug-and-play module that can be integrated with any existing VLM, establishing an effective and flexible paradigm for building the next generation of perception-aware models. VLM-FO1 excels at a wide range of fine-grained perception tasks, including Object Grounding, Region Generative Understanding, Visual Region Reasoning, and more. 🧩 **Plug-and-Play Modularity:** Our framework is designed as a set of enhancement modules that can be seamlessly integrated with any pre-trained VLM, preserving its original weights and capabilities. 🧠 **Hybrid Region Encoder (HFRE):** We introduce a novel Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features, creating powerful region tokens that capture both high-level meaning and fine-grained spatial detail. 🎯 **State-of-the-Art Performance:** VLM-FO1 achieves SOTA results across a diverse suite of benchmarks. ✅ **Preserves General Abilities:** Our two-stage training strategy ensures that fine-grained perception is gained without causing catastrophic forgetting of the base model's powerful general visual understanding abilities. ### Model Sources - **Repository:** [https://github.com/om-ai-lab/VLM-FO1] - **Paper:** [https://arxiv.org/pdf/2509.25916] ## Citation ```bibtex @article{liu2025vlm, title={VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs}, author={Liu, Peng and Shen, Haozhan and Fang, Chunxin and Sun, Zhicheng and Liao, Jiajia and Zhao, Tiancheng}, journal={arXiv preprint arXiv:2509.25916}, year={2025} } ```