82 results for “topic:large-multimodal-models”
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning
[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"
A Framework of Small-scale Large Multimodal Models
A collection of resources on applications of multi-modal learning in medical imaging.
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
An open-source implementation for training LLaVA-NeXT.
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Stream-Omni is a GPT-4o-like language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Open Platform for Embodied Agents
A curated list of awesome Multimodal studies.
The official evaluation suite and dynamic data release for MixEval.
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
a comprehensive and critical synthesis of the emerging role of GenAI across the full autonomous driving stack
[CVPR 2026] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
[CVPR 2026] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.
[ICLR'25] Reconstructive Visual Instruction Tuning
(NIPS 2025) OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
[CVPR 2025 🔥] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.