"topic:multimodal-large-language-models" — Search

394 results for “topic:multimodal-large-language-models”

BradyFU/Awesome-Multimodal-Large-Language-Models

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

chain-of-thoughtin-context-learninginstruction-followinginstruction-tuninglarge-language-modelslarge-vision-language-modellarge-vision-language-modelsmulti-modalitymultimodal-chain-of-thoughtmultimodal-in-context-learningmultimodal-instruction-tuningmultimodal-large-language-modelsvisual-instruction-tuning

X-

X-PLUG/MobileAgent

Mobile-Agent: The Powerful GUI Agent Family

Python8.1k811Updated 1 hour ago

agentandroidappautomationcopilotguimllmmobilemobile-agentsmultimodalmultimodal-agentmultimodal-large-language-models

joanrod/star-vector

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

Python4.3k239Updated 18 hours ago

llmmultimodal-large-language-modelssvgvlm

ictnlp/LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Python3.1k222Updated 1 day ago

large-language-modelsmultimodal-large-language-modelsspeech-interactionspeech-language-modelspeech-to-speechspeech-to-text

sherlockchou86/VideoPipe

A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化（视频分析）框架，觉得有帮助的请给个星星 : )

C++2.7k427Updated 2 hours ago

aibehaviour-analysiscvdeep-learningdeepstreamface-recognitionfeature-extractiongstreamerimage-classificationimage-segmentationlicense-plate-recognitionllmmultimodal-large-language-modelsobject-detectionollamaopenaiopencvreidsimilarity-searchvideo-analysis

VITA-MLLM/VITA

✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python2.5k183Updated 1 day ago

large-multimodal-modelsmultimodal-large-language-modelsomni-language-modelomni-modal-video-understandingomni-model

X-

X-PLUG/mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Python2.4k146Updated 18 hours ago

chart-understandingdocument-understandingmllmmultimodalmultimodal-large-language-modelstable-understanding

cambrian-mllm/cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

Python2.0k135Updated 2 days ago

chatbotclipcomputer-visiondinoinstruction-tuninglarge-language-modelsllmsmllmmultimodal-large-language-modelsrepresentation-learning

YangLing0818/RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

Jupyter Notebook1.8k102Updated 3 days ago

image-edittinglarge-language-modelsmultimodal-large-language-modelstext-to-image

ByteDance-Seed/Seed1.5-VL

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

Jupyter Notebook1.6k63Updated 1 week ago

cookbooklarge-language-modelmultimodal-large-language-modelsvision-language-model

AIDC-AI/Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Python1.4k84Updated 1 week ago

chatbotllama3multimodalmultimodal-large-language-modelsmultimodalityqwenvision-language-learningvision-language-model

Henry-23/VideoChat

实时交互数字人，可自定义形象与音色，支持音色克隆，对话延迟低至3s。Real-time voice interactive digital human, customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

Python1.2k158Updated 2 days ago

asrdialogue-systemsdigital-humanend-to-endgradio-python-applip-syncmultimodal-large-language-modelsmusetalkreal-timestreamingtalking-headtts

AIDC-AI/Awesome-Unified-Multimodal-Models

Awesome Unified Multimodal Models

1.1k36Updated 5 hours ago

multimodal-large-language-modelsmultimodal-modelstext-to-image-generationunified-multimodal-modelsvision-language-model

BAAI-DCAI/Bunny

A family of lightweight multimodal models.

Python1.1k77Updated 1 week ago

chatgptchineseenglishgpt-4mllmmultimodal-large-language-modelsvlm

NVIDIA/audio-flamingo

PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

1.0k85Updated 3 days ago

audio-captioningaudio-language-modelsaudio-question-answeringaudio-reasoningmultimodal-large-language-models

X-

X-LANCE/SLAM-LLM

A Framework for Speech, Language, Audio, Music Processing with Large Language Model

Python1.0k108Updated 1 day ago

audio-processinglarge-language-modelmultimodal-large-language-modelsmusic-processingpeftspeech-processing

yaotingwangofficial/Awesome-MCoT

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

95829Updated 2 days ago

chain-of-thoughtcotdeepseek-r1instruction-tuninglarge-vision-language-modelmctsmllm-reasoningmultimodalmultimodal-chain-of-thoughtmultimodal-large-language-modelsopenai-o1reasoningslow-thinkingsurveysystem-2

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

92580Updated 5 days ago

large-language-modelslarge-multimodal-modelsmedical-imagingmedical-report-generationmultimodal-deep-learningmultimodal-large-language-modelsmultimodal-learningvisual-question-answering

FunAudioLLM/Fun-ASR

Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.

Python91780Updated 1 hour ago

audioaudio-language-modelaudio-understandingfun-asrmultimodal-large-language-modelspytorchspeaker-diarizationspeech-recognition

LLaVA-VL/LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Python76358Updated 1 month ago

agentlarge-language-modelslarge-multimodal-modelsmultimodal-large-language-modelstool-use

MME-Benchmarks/Video-MME

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

73127Updated 5 days ago

large-language-modelslarge-vision-language-modelsmmemultimodal-large-language-modelsvideovideo-mme

deepglint/unicom

Large-Scale Visual Representation Model

Python70434Updated 1 hour ago

embodied-artificial-intelligencelaion400mlarge-language-modelslarge-sacle-pretrained-modelmultimodal-large-language-modelsvision-transformer

rese1f/MovieChat

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Python68843Updated 1 week ago

computer-visiondatasetlarge-language-modelsllamalong-video-understandingmultimodal-large-language-models

EvolvingLMMs-Lab/NEO

NEO Series: Native Vision-Language Models from First Principles

Python67023Updated 5 hours ago

agiencoder-free-vlmlarge-language-modelsmllmmultimodalmultimodal-large-language-modelsnative-multimodal-modelvlm

VITA-MLLM/Woodpecker

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models

Python65030Updated 5 days ago

hallucinationhallucinationslarge-language-modelsllmmllmmultimodal-large-language-modelsmultimodality

Coobiw/MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.

Jupyter Notebook64232Updated 4 hours ago

deepspeedfine-tuningmllmmodel-parallelmultimodal-large-language-modelspipeline-parallelismpretrainingqwenvideo-language-modelvideo-large-language-models

FoundationVision/Liquid

(Accepted by IJCV) Liquid: Language Models are Scalable and Unified Multi-modal Generators

Python64034Updated 1 week ago

autoregressive-modelsgenerativegenerative-aiimage-genlarge-language-modelsllmsmultimodal-large-language-modelstext-to-imagetext-to-image-generation

NVlabs/OmniVinci

OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.

Python63751Updated 2 days ago

audio-language-modeldeep-learninglarge-language-modelsmultimodal-large-language-modelsvision-language-model

SkyworkAI/Vitron

NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Python57935Updated 1 month ago

mllmmultimodal-large-language-modelssegmentation

ictnlp/LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Python56230Updated 1 week ago

efficientgpt4ogpt4vlarge-language-modelslarge-multimodal-modelsllamallavamultimodalmultimodal-large-language-modelsvideovisionvision-language-modelvisual-instruction-tuning

Page 1 of 14