Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2503.20215

about 1 hour ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Less is More: Recursive Reasoning with Tiny Networks

Paper • 2510.04871 • Published Oct 6 • 497
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

Paper • 2510.07499 • Published Oct 8 • 48
Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Paper • 2509.13683 • Published Sep 17 • 8
Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Paper • 2509.00798 • Published Aug 31 • 1

Voice2Voice models

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166

Qwen Technical Report

Paper • 2309.16609 • Published Sep 28, 2023 • 37
Qwen2.5-1M Technical Report

Paper • 2501.15383 • Published Jan 26 • 72
Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 376
Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 152

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Paper • 2308.01390 • Published Aug 2, 2023 • 33
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

Paper • 2510.23763 • Published Oct 27 • 53
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Paper • 2510.15870 • Published Oct 17 • 89
Qwen3-Omni Technical Report

Paper • 2509.17765 • Published Sep 22 • 140
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Paper • 2510.13747 • Published Oct 15 • 29

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166
totally-not-an-llm/EverythingLM-13b-16k

Text Generation • Updated Apr 23, 2024 • 2.23k • 33
llava-hf/llava-v1.6-mistral-7b-hf

Image-Text-to-Text • 8B • Updated May 1 • 350k • 300

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Paper • 2504.12626 • Published Apr 17 • 51
Qwen3 Technical Report

Paper • 2505.09388 • Published May 14 • 317
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 265
DINOv3

Paper • 2508.10104 • Published Aug 13 • 285

Vision Language Models: 2025 Update

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30 • 154k • 1.83k
Running

Featured

364

Qwen2.5 Omni 7B Demo

🏆

364

Generate text and speech responses from text, audio, images, or video input
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166
openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5 • 104k • 1.27k

Llammy3.2-3B-GUFF

prithivMLmods/Llama-Sentient-3.2-3B-Instruct

Text Generation • Updated Dec 10, 2024 • 28 • 9
bartendr604/Llama.Diffusion.Flix

Updated Apr 12 • 1
Running

1.42k

FLUX Unlimited

🔥

1.42k

Use the FLUX model as much as you want.
HKUSTAudio/xcodec2

Audio-to-Audio • 0.8B • Updated Feb 23 • 26.9k • 91

about 1 hour ago

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

Paper • 2510.23763 • Published Oct 27 • 53
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Paper • 2510.15870 • Published Oct 17 • 89
Qwen3-Omni Technical Report

Paper • 2509.17765 • Published Sep 22 • 140
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Paper • 2510.13747 • Published Oct 15 • 29

Less is More: Recursive Reasoning with Tiny Networks

Paper • 2510.04871 • Published Oct 6 • 497
When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

Paper • 2510.07499 • Published Oct 8 • 48
Improving Context Fidelity via Native Retrieval-Augmented Reasoning

Paper • 2509.13683 • Published Sep 17 • 8
Multimodal Iterative RAG for Knowledge-Intensive Visual Question Answering

Paper • 2509.00798 • Published Aug 31 • 1

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166
totally-not-an-llm/EverythingLM-13b-16k

Text Generation • Updated Apr 23, 2024 • 2.23k • 33
llava-hf/llava-v1.6-mistral-7b-hf

Image-Text-to-Text • 8B • Updated May 1 • 350k • 300

Voice2Voice models

Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Paper • 2504.12626 • Published Apr 17 • 51
Qwen3 Technical Report

Paper • 2505.09388 • Published May 14 • 317
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4 • 265
DINOv3

Paper • 2508.10104 • Published Aug 13 • 285

Qwen Technical Report

Paper • 2309.16609 • Published Sep 28, 2023 • 37
Qwen2.5-1M Technical Report

Paper • 2501.15383 • Published Jan 26 • 72
Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 376
Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 152

Vision Language Models: 2025 Update

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30 • 154k • 1.83k
Running

Featured

364

Qwen2.5 Omni 7B Demo

🏆

364

Generate text and speech responses from text, audio, images, or video input
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166
openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5 • 104k • 1.27k

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Paper • 2308.01390 • Published Aug 2, 2023 • 33
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 166

Llammy3.2-3B-GUFF

prithivMLmods/Llama-Sentient-3.2-3B-Instruct

Text Generation • Updated Dec 10, 2024 • 28 • 9
bartendr604/Llama.Diffusion.Flix

Updated Apr 12 • 1
Running

1.42k

FLUX Unlimited

🔥

1.42k

Use the FLUX model as much as you want.
HKUSTAudio/xcodec2

Audio-to-Audio • 0.8B • Updated Feb 23 • 26.9k • 91

Previous
1
2
3
4
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs