59 results for “topic:multimodal-llm”
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics recognition capability.
Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"
A SOTA Industrial-Grade All-in-One ASR system with ASR, VAD, LID, and Punc modules. FireRedASR2 supports Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and both speech and singing ASR. FireRedVAD supports speech/singing/music in 100+ langs. FireRedLID supports 100+ langs and 20+ zh dialects. FireRedPunc supports zh and en.
[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs
Research Code for Multimodal-Cognition Team in Ant Group
[AAAI 2026 Oral] Official repository for InfiGUI-G1. We introduce Adaptive Exploration Policy Optimization (AEPO) to overcome semantic alignment bottlenecks in GUI agents through efficient, guided exploration.
[AAAI 2026] The Official Implementation for "Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation"
[IROS'25 Oral & NeurIPSw'24] Official implementation of "MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control "
Teaching Vison-Language Models as Progress Estimators across Embodied Scenarios
[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"
The code repository for "Wings: Learning Multimodal LLMs without Text-only Forgetting" [NeurIPS 2024]
Paper list of Video LLM hallucination. Welcome to Star and Contribute!
A minimal, hackable Vision-Language Model built on Karpathy’s nanochat — add image understanding and multimodal chat for under $200 in compute.
Efficient Test-Time Scaling for Small Vision-Language Models, official implementation of the paper, test-time scaling via test-time augmentation
[ACL 2024] Dataset and Code of "ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction"
Official repository of the paper: Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics
Medical Report Generation And VQA (Adapting XrayGPT to Any Modality)
Q-HEART: ECG Question Answering via Knowledge-Informed Multimodal LLMs (ECAI 2025)
[NAACL 2025 Findings] Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding
SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.
Streamlit app to chat with images using Multi-modal LLMs.
Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"
LLaVA base model for use with Autodistill.
AgentNav: Zero-shot sparsely grounded long-range visual navigation in real-world cities using Multimodal Large Language Models (MLLMs).
D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
Kani extension for supporting vision-language models (VLMs). Comes with model-agnostic support for GPT-Vision and LLaVA.
The future of AI is speaking Chilean, cachai?
一个基于 MaaFramework 与多模态大模型,通过视觉理解屏幕内容,利用 Planner-Executor-Verifier 三模式架构自动规划并执行任务的 GUI 智能体系统。
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
🎨 AI StoryWeaver — Where creativity meets Generative AI. Upload images and watch Google Gemini 2.5 craft and narrate human-like multilingual stories with emotion and voice. A hands-on showcase of multimodal intelligence, storytelling design, and AI-human expression.