-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 17 -
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Paper • 2310.14566 • Published • 27 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper • 2310.13355 • Published • 9 -
Conditional Diffusion Distillation
Paper • 2310.01407 • Published • 20
Collections
Discover the best community collections!
Collections including paper arxiv:2311.05437
-
Communicative Agents for Software Development
Paper • 2307.07924 • Published • 6 -
Self-Refine: Iterative Refinement with Self-Feedback
Paper • 2303.17651 • Published • 2 -
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
Paper • 2312.10003 • Published • 44 -
ReAct: Synergizing Reasoning and Acting in Language Models
Paper • 2210.03629 • Published • 31
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 88 -
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Paper • 2403.13248 • Published • 78 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
Paper • 2409.20551 • Published • 14
-
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper • 2401.01885 • Published • 28 -
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance
Paper • 2401.15687 • Published • 24 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper • 2312.17172 • Published • 30 -
MouSi: Poly-Visual-Expert Vision-Language Models
Paper • 2401.17221 • Published • 9
-
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
NExT-Chat: An LMM for Chat, Detection and Segmentation
Paper • 2311.04498 • Published • 14 -
Sam Hq
🏆80Segment objects in images using text prompts or scribbles
-
ShilongLiu/GroundingDINO
Updated • 160
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 29 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
Visual In-Context Prompting
Paper • 2311.13601 • Published • 18 -
Textbooks Are All You Need
Paper • 2306.11644 • Published • 152 -
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
Paper • 2308.08155 • Published • 10 -
LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
Paper • 2303.02927 • Published • 3
-
Visual Instruction Tuning
Paper • 2304.08485 • Published • 20 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 39 -
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper • 2309.14525 • Published • 31
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 31 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 17 -
HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Paper • 2310.14566 • Published • 27 -
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper • 2310.13355 • Published • 9 -
Conditional Diffusion Distillation
Paper • 2310.01407 • Published • 20
-
Gemini Robotics: Bringing AI into the Physical World
Paper • 2503.20020 • Published • 29 -
Magma: A Foundation Model for Multimodal AI Agents
Paper • 2502.13130 • Published • 58 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper • 2410.23218 • Published • 49
-
Communicative Agents for Software Development
Paper • 2307.07924 • Published • 6 -
Self-Refine: Iterative Refinement with Self-Feedback
Paper • 2303.17651 • Published • 2 -
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
Paper • 2312.10003 • Published • 44 -
ReAct: Synergizing Reasoning and Acting in Language Models
Paper • 2210.03629 • Published • 31
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Paper • 2402.17177 • Published • 88 -
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Paper • 2403.13248 • Published • 78 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models
Paper • 2409.20551 • Published • 14
-
Visual In-Context Prompting
Paper • 2311.13601 • Published • 18 -
Textbooks Are All You Need
Paper • 2306.11644 • Published • 152 -
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework
Paper • 2308.08155 • Published • 10 -
LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
Paper • 2303.02927 • Published • 3
-
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper • 2401.01885 • Published • 28 -
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance
Paper • 2401.15687 • Published • 24 -
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Paper • 2312.17172 • Published • 30 -
MouSi: Poly-Visual-Expert Vision-Language Models
Paper • 2401.17221 • Published • 9
-
Visual Instruction Tuning
Paper • 2304.08485 • Published • 20 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 39 -
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper • 2309.14525 • Published • 31
-
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
NExT-Chat: An LMM for Chat, Detection and Segmentation
Paper • 2311.04498 • Published • 14 -
Sam Hq
🏆80Segment objects in images using text prompts or scribbles
-
ShilongLiu/GroundingDINO
Updated • 160
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 31 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 22 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69