52 results for “topic:mllms”
InternRobotics' open platform for building generalized navigation foundation models.
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
[ICLR 2025] This is the official repository of our paper "MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine“
Efficient Multimodal Large Language Models: A Survey
This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" (ACL 2025 Oral).
[AAAI2026] X-SAM: From Segment Anything to Any Segmentation
[CVPR 2026] G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
This repository collects research papers of large Foundation Models for Scenario Generation and Analysis in Autonomous Driving. The repository will be continuously updated to track the latest update.
[ACL 2025] The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.
[NeurIPS 2025] Official Repo of Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large Language Models
Official repo of paper "SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models". A post-training framework that creates a cost-effective, self-iterative optimization loop.
[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
Official repository of the paper "A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models"
[NeurIPS25 & ICML25 Workshop on Reliable and Responsible Foundation Models] A Simple Baseline Achieving Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1. Paper at: https://arxiv.org/abs/2503.10635
[CVPR 2025] Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
PDF Parsing Tool: GOT's vLLM acceleration implementation, MinerU for layout recognition, and GOT for table formula parsing.
Awesome Reasoning in MLLMs: Papers and Projects about learning to reason with MLLMs, including Chain-of-Thought (CoT), OpenAl o1, and DeepSeek-R1
【CVPR2025】IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
[AAAI 2026] Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models
[ICCVW 2025 (Oral)] Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
[CVPR 2026] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
A Survey of Multimodal Retrieval-Augmented Generation
[ICLR 2025] Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate
[NeurIPS'25] Backdoor Cleaning without External Guidance in MLLM Fine-tuning
On Path to Multimodal Generalist: General-Level and General-Bench
This repository provides a hierarchical taxonomy of key paperson computer vision methods, surpassing flat lists with fine-grained subcategories that delineate emerging hotspots
ICCV 2025 official implementation for Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks