"topic:large-multimodal-models" — Search

82 results for “topic:large-multimodal-models”

✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

large-multimodal-modelsmultimodal-large-language-modelsomni-language-modelomni-modal-video-understandingomni-model

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

Python1.5k223Updated 20 hours ago

agentsai-agentsai-agents-frameworkanthropiccomputer-usegenerative-process-automationgoogle-geminigpt4ohuggingfacelarge-action-modellarge-language-modelslarge-multimodal-modelsomniparseropenaiprocess-automationprocess-miningpythonsegment-anythingtransformersultralytics

NVlabs/describe-anything

[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Python1.5k88Updated 18 hours ago

describe-anythingdetailed-localized-captioninglarge-multimodal-modelsvision-language-model

ShareGPT4Omni/ShareGPT4Video

[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"

Python1.1k44Updated 1 week ago

chatgptgptgpt-4vlarge-language-modelslarge-multimodal-modelslarge-video-language-modelslarge-vision-language-modelssoratext-to-video

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

Python96296Updated 2 days ago

large-multimodal-modelsllamallavanlptinyllamatransformersvision-language

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

92580Updated just now

large-language-modelslarge-multimodal-modelsmedical-imagingmedical-report-generationmultimodal-deep-learningmultimodal-large-language-modelsmultimodal-learningvisual-question-answering

LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Python76458Updated 18 hours ago

agentlarge-language-modelslarge-multimodal-modelsmultimodal-large-language-modelstool-use

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Python56230Updated 1 week ago

efficientgpt4ogpt4vlarge-language-modelslarge-multimodal-modelsllamallavamultimodalmultimodal-large-language-modelsvideovisionvision-language-modelvisual-instruction-tuning

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Python54849Updated 1 week ago

computer-visiondeep-learningdeep-neural-networksevaluationfoundation-modelslarge-language-modelslarge-multimodal-modelsllmllmsmachine-learningmultimodalmultimodal-deep-learningmultimodal-learningmultimodalitynatural-language-processingquestion-answeringstemvisual-question-answering

xiaoachen98/Open-LLaVA-NeXT

An open-source implementation for training LLaVA-NeXT.

Python43623Updated 3 days ago

chatbotchatgptgpt-4gpt4olarge-multimodal-modelsllamallama3llavallava-nextmulti-modalitymultimodalvision-language-modelvisual-language-learning

shikiw/OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Python39933Updated 4 days ago

chatbotchatgptgpt-4large-multimodal-modelsllamamultimodalvision-language-learningvision-language-model

ictnlp/Stream-Omni

Stream-Omni is a GPT-4o-like language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.

Python38444Updated 1 week ago

asrchatbotchatgptgpt-4ointeractionlarge-language-modelslarge-multimodal-modelsllamallmmutlimodalquestion-answeringspeechspeech-recognitionspeech-synthesisspeech-to-textttsvisionvision-language-model

zjysteven/lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Python36843Updated 2 days ago

finetuningfoundation-modelsinstruction-tuninglarge-language-modellarge-multimodal-modelsllavallava-nextmultimodalmultimodal-large-language-modelsqwen-vlvision-languagevisual-instruction-tuning

thunlp/LEGENT

Open Platform for Embodied Agents

Python33822Updated 1 week ago

embodied-ailanguage-groundinglarge-multimodal-modelsphysics-enginerobot-simulator

friedrichor/Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

31723Updated 3 days ago

deep-learninglarge-multimodal-modelsmultimodalmultimodal-datamultimodal-deep-learningmultimodal-dialoguemultimodal-large-language-modelsmultimodal-learning

JinjieNi/MixEval

The official evaluation suite and dynamic data release for MixEval.

Python25541Updated 1 month ago

benchmarkbenchmark-mixturebenchmarking-frameworkbenchmarking-suiteevaluationevaluation-frameworkfoundation-modelslarge-language-modellarge-language-modelslarge-multimodal-modelsllm-evaluationllm-evaluation-frameworkllm-inferencemixeval

ShareGPT4Omni/ShareGPT4V

[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions

Python2518Updated 6 days ago

chatgpteccv2024gptgpt-4vgpt4vinstruction-tuninglanguage-modellarge-language-modelslarge-multimodal-modelslarge-vision-language-modelsvision-language-model

ritzz-ai/GUI-R1

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Python23317Updated 3 days ago

deep-reinforcement-learninggrpogui-agentlarge-multimodal-modelsmllm-reasoningmultimodalmultimodal-large-language-modelso1r1

taco-group/GenAI4AD

a comprehensive and critical synthesis of the emerging role of GenAI across the full autonomous driving stack

23115Updated 4 days ago

3dvisionagentic-aiautonomous-drivingautonomous-vehiclesdiffusion-modelsembodied-aigenerative-ailarge-language-modelslarge-multimodal-modelsnvidiaroboticssim2realtransportationvision-language-model

EvolvingLMMs-Lab/LongVT

[CVPR 2026] LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

Python20913Updated just now

agilarge-multimodal-modelslong-video-understandingmllmmultimodalmultimodal-large-language-modelstool-using-agentvlm

MMStar-Benchmark/MMStar

[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"

Python2045Updated 4 days ago

evaluationlarge-language-modelslarge-multimodal-modelslarge-vision-language-modellarge-vision-language-modelsllmllmslvlmlvlmsmultimodalmultimodal-learningmultimodalityvisual-question-answering

itsqyh/Awesome-LMMs-Mechanistic-Interpretability

A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.

1925Updated 3 days ago

generativegenerative-modellarge-language-modelslarge-multimodal-modelslarge-vision-language-modelsmechanistic-interpretabilitypaperlistvision-foundation-modelvision-models

sshh12/multi_token

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Python19116Updated 3 days ago

large-contextlarge-language-modelslarge-multimodal-modelsllavallmmulti-modalitymultimodalvision-language-model

EvolvingLMMs-Lab/OpenMMReasoner

[CVPR 2026] OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Python1517Updated 1 day ago

agilarge-multimodal-modelsmllmmultimodalmultimodal-large-language-modelsmultimodal-reasoningvlm

WisconsinAIVision/YoChameleon

🦎 Yo'Chameleon: Your Personalized Chameleon (CVPR 2025)

Python1512Updated 1 week ago

chameleoncvprcvpr2025large-language-modelslarge-multimodal-modelsllmslmmspersonalizationpersonalizedpersonalized-generation

mbzuai-oryx/GeoPixel

GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabilities.

Python14318Updated 4 days ago

foundation-modelsgrounding-llmslarge-multimodal-modelslarge-vision-language-modelsremote-sensingsegmentation-modelsvision-language-models

Haochen-Wang409/ross

[ICLR'25] Reconstructive Visual Instruction Tuning

Python1355Updated 1 month ago

diffusioniclrlarge-multimodal-modelsmultimodal-large-language-models

RainBowLuoCS/OpenOmni

(NIPS 2025) OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

Python1277Updated 3 weeks ago

imagelarge-language-modellarge-multimodal-modelsmultimodalmultimodal-large-language-modelsomnispeech

showlab/WorldGUI

Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.

Python11610Updated 4 days ago

agentsgui-agentgui-applicationlarge-multimodal-models

hiyamdebary/EarthDial

[CVPR 2025 🔥] EarthDial: Turning Multi-Sensory Earth Observations to Interactive Dialogues.

Python1139Updated 1 week ago

foundation-modelslarge-multimodal-modelslarge-vision-language-modelsremote-sensing

Page 1 of 3