394 results for “topic:multimodal-large-language-models”
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
Mobile-Agent: The Powerful GUI Agent Family
StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化(视频分析)框架,觉得有帮助的请给个星星 : )
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
实时交互数字人,可自定义形象与音色,支持音色克隆,对话延迟低至3s。Real-time voice interactive digital human, customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.
Awesome Unified Multimodal Models
A family of lightweight multimodal models.
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
A Framework for Speech, Language, Audio, Music Processing with Large Language Model
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
A collection of resources on applications of multi-modal learning in medical imaging.
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Large-Scale Visual Representation Model
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
NEO Series: Native Vision-Language Models from First Principles
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
(Accepted by IJCV) Liquid: Language Models are Scalable and Unified Multi-modal Generators
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.