"topic:mllms" — Search

52 results for “topic:mllms”

InternRobotics' open platform for building generalized navigation foundation models.

mllmsnavigationroboticsspatial-aispatial-intelligencevision-language-action-modelvision-language-navigationvisual-navigationvlavlm

yuanze-lin/Olympus

[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"

Python42772Updated 9 months ago

chatbotchatgptdeeplearningfoundation-modelsinstruction-tuningllavallmsmllmsmulti-modalitymultimodalpytorchvision-language-model

UCSC-VLAA/MedTrinity-25M

[ICLR 2025] This is the official repository of our paper "MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine“

Python40228Updated 8 months ago

datasetmllmsmultimodality

swordlidev/Efficient-Multimodal-LLMs-Survey

Efficient Multimodal Large Language Models: A Survey

38821Updated 10 months ago

efficientllmmllms

OS-Agent-Survey/OS-Agent-Survey

This is the repo for the paper "OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use" (ACL 2025 Oral).

38619Updated 7 months ago

agentbrowser-agentcomputer-usecomputer-usingcomputer-using-agentcomputing-devicesguigui-agentllmsmllmsoperatoros-agentos-agent-surveyphone-usesurveyweb-agent

wanghao9610/X-SAM

[AAAI2026] X-SAM: From Segment Anything to Any Segmentation

Python36213Updated 3 weeks ago

mllmssamsegmentation

InternRobotics/G2VLM

[CVPR 2026] G2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Python2899Updated 3 weeks ago

3d-llms3d-reconstructionmllmsspatial-intelligencespatial-reasoningspatial-understanding

TUM-AVS/FM-AD-Survey

This repository collects research papers of large Foundation Models for Scenario Generation and Analysis in Autonomous Driving. The repository will be continuously updated to track the latest update.

18315Updated 3 days ago

autonomous-drivingdiffusion-modelsfoundation-modelsllmsmllmsscenario-analysisscenario-generationvlmsworld-models

sun-hailong/TVC

[ACL 2025] The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.

Python1430Updated 10 months ago

cotforgettingmllmsmultimodel-large-language-modelr1reasoning

aim-uofa/Omni-R1

[NeurIPS 2025] Official Repo of Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Python1165Updated 3 months ago

grpomllmsneurips-2025omnimodalrl

XduSyL/EventGPT

🔥[CVPR2025] EventGPT: Event Stream Understanding with Multimodal Large Language Models

Python10511Updated 7 months ago

chatbotevent-languageevent-stream-understandingeventgptfoundation-modelsllmmllmsmutilmodelrepresentation-learning

WayneJin0918/SRUM

Official repo of paper "SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models". A post-training framework that creates a cost-effective, self-iterative optimization loop.

Python966Updated 3 months ago

bageldiffusion-modelsgenerative-aillmmllmspost-trainingunified-model

aim-uofa/SegAgent

[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Python923Updated 7 months ago

agentmllmssegment-anythingvlms

HVision-NKU/GlimpsePrune

Official repository of the paper "A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models"

Python921Updated 1 month ago

inference-efficiencylvlmsmllmstoken-compressionvisual-token-pruning

VILA-Lab/M-Attack

[NeurIPS25 & ICML25 Workshop on Reliable and Responsible Foundation Models] A Simple Baseline Achieving Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1. Paper at: https://arxiv.org/abs/2503.10635

Python917Updated 1 month ago

adversarial-attackattacklvlmsmllms

GeWu-Lab/Crab

[CVPR 2025] Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation

Python833Updated 2 months ago

mllms

aim-uofa/Active-o3

ACTIVE-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

791Updated 3 months ago

active-perceptionactive-visiongrpomllmso3rlthinking-with-image

liunian-Jay/MU-GOT

PDF Parsing Tool: GOT's vLLM acceleration implementation, MinerU for layout recognition, and GOT for table formula parsing.

Python655Updated 1 year ago

minerumllmspdf-document-processorragretrieval-augmented-generationvllm

HJYao00/Awesome-Reasoning-MLLM

Awesome Reasoning in MLLMs: Papers and Projects about learning to reason with MLLMs, including Chain-of-Thought (CoT), OpenAl o1, and DeepSeek-R1

624Updated 12 months ago

cotmllmso1r1reasoning

924973292/IDEA

【CVPR2025】IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

Python476Updated 11 months ago

captionmllmsmulti-modalmulti-modal-learningreidthermal-imaging

JaaackHongggg/WorldSense

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

JavaScript431Updated 1 week ago

mllmsomnimodal

xuyang-liu16/GlobalCom2

[AAAI 2026] Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Python391Updated 1 month ago

large-language-modelsllmmllmsmodel-compressionmulti-modaltoken-reduction

HashmatShadab/Robust-LLaVA

[ICCVW 2025 (Oral)] Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Python280Updated 4 months ago

adversarial-attacksmllmsrobustness

RuoyuChen10/EAGLE

[CVPR 2026] Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Jupyter Notebook220Updated 6 days ago

mllmsxai

PanguIR/MRAGSurvey

A Survey of Multimodal Retrieval-Augmented Generation

202Updated 4 months ago

large-language-modelsllmsmllmsmultimodal-generationmultimodal-large-language-modelsmultimodal-retrievalmultimodal-retrieval-augmented-generationretrieved-augmented-generation

MraDonkey/DMAD

[ICLR 2025] Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate

Python192Updated 10 months ago

debatellmsmllmsmulti-agentmultiagentpromptingreasoningself-correction

XuankunRong/BYE

[NeurIPS'25] Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Python182Updated 5 months ago

backdoor-defensemllmssafety

path2generalist/General-Level

On Path to Multimodal Generalist: General-Level and General-Bench

Python183Updated 8 months ago

benchmarkllmllm-evaluationmllmmllm-evaluationmllmsmultimodal-generalistmultimodal-large-language-models

yuxin-jiang/Awesome-Computer-Vision-Paper

This repository provides a hierarchical taxonomy of key paperson computer vision methods, surpassing flat lists with fine-grained subcategories that delineate emerging hotspots

130Updated 3 months ago

3danomaly-detectionanomaly-generationcomputer-visiondeep-learningdiffusion-modelsmllms

JarvisUSTC/DiffPure-RobustVLM

ICCV 2025 official implementation for Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Jupyter Notebook100Updated 7 months ago

mllmssafety

Page 1 of 2