"topic:multimodal-llm" — Search

59 results for “topic:multimodal-llm”

Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics recognition capability.

Python1.8k160Updated 2 weeks ago

asrautomatic-speech-recognitionconformerindustrial-gradellmmultimodal-llmopen-sourcespeech-recognitionspeechllmtransformer

eric-ai-lab/MiniGPT-5

Official implementation of paper "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens"

Python86352Updated 10 months ago

diffusion-modelsmultimodal-generationmultimodal-llmtransformers

FireRedTeam/FireRedASR2S

A SOTA Industrial-Grade All-in-One ASR system with ASR, VAD, LID, and Punc modules. FireRedASR2 supports Chinese (Mandarin, 20+ dialects/accents), English, code-switching, and both speech and singing ASR. FireRedVAD supports speech/singing/music in 100+ langs. FireRedLID supports 100+ langs and 20+ zh dialects. FireRedPunc supports zh and en.

Python37321Updated 2 days ago

asrasr-pipelineaudio-event-classificationaudio-event-detectionautomatic-speech-recognitionindustrial-gradelanguage-identificationlidllmmultimodal-llmopen-sourcepunctuation-predictionpunctuation-restorationsotaspeech-recognitionspeechllmvadvoice-activity-detection

xmed-lab/TAM

[ICCV25 Oral] Token Activation Map to Visually Explain Multimodal LLMs

Python1817Updated 3 months ago

camexplainable-aimllmmultimodal-llmtoken-activation-maptransformers

alipay/Ant-Multi-Modal-Framework

Research Code for Multimodal-Cognition Team in Ant Group

Python1737Updated 5 months ago

image-text-retrievalmultimodal-learningmultimodal-llmvideo-editingvideo-text-retrieval

InfiXAI/InfiGUI-G1

[AAAI 2026 Oral] Official repository for InfiGUI-G1. We introduce Adaptive Exploration Policy Optimization (AEPO) to overcome semantic alignment bottlenecks in GUI agents through efficient, guided exploration.

Python13714Updated 3 months ago

computer-visiondeep-learninggui-agentgui-groundinglarge-language-modelsmultimodal-llmreinforcement-learning

yuxin-jiang/Anomagic

[AAAI 2026] The Official Implementation for "Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation"

Python1296Updated 2 weeks ago

anomaly-detectioncomputer-visiondiffusion-modelsmultimodal-llmtext-to-imagezero-shot-learning

Zhoues/MineDreamer

[IROS'25 Oral & NeurIPSw'24] Official implementation of "MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control "

Python1047Updated 9 months ago

diffusion-modelembodied-agentminecraftmultimodal-llm

ProgressLM/ProgressLM

Teaching Vison-Language Models as Progress Estimators across Embodied Scenarios

Python959Updated 1 month ago

embodied-agentmultimodal-llmroboticsspatialvision-language-model

UCSC-VLAA/vllm-safety-benchmark

[ECCV 2024] Official PyTorch Implementation of "How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs"

Python875Updated 2 years ago

adversarial-attacksbenchmarkdatasetsllmmultimodal-llmrobustnesssafetyvision-language-model

AIDC-AI/Wings

The code repository for "Wings: Learning Multimodal LLMs without Text-only Forgetting" [NeurIPS 2024]

Python260Updated 1 year ago

deep-learningmllmmultimodal-large-language-modelsmultimodal-llmtext-only-forgetting

hukcc/Awesome-Video-Hallucination

Paper list of Video LLM hallucination. Welcome to Star and Contribute!

Python230Updated 1 week ago

awesome-listcomputer-visionhallucinationhallucination-evaluationmultimodal-large-language-modelsmultimodal-llmsurveyvideo-llmvideo-understandingvision-language-model

Masoudjafaripour/nanochat-VLM

A minimal, hackable Vision-Language Model built on Karpathy’s nanochat — add image understanding and multimodal chat for under $200 in compute.

Python202Updated 2 weeks ago

finetuningllmllmsmultimodal-llmnanochatpytorchvision-language-tokenizervision-tokenizationvlmvlms

monurcan/efficient_test_time_scaling

Efficient Test-Time Scaling for Small Vision-Language Models, official implementation of the paper, test-time scaling via test-time augmentation

Python170Updated 3 months ago

inference-time-scalingllmmultimodal-llmtest-time-adaptationtest-time-augmentationtest-time-scalingvision-language-modelvlm

HenryPengZou/ImplicitAVE

[ACL 2024] Dataset and Code of "ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction"

Jupyter Notebook160Updated 1 year ago

attribute-value-extractionimplicit-attribute-value-extractionmultimodal-llmvision-language-model

shanface33/GPT4MF_UB

Official repository of the paper: Can ChatGPT Detect DeepFakes? A Study of Using Multimodal Large Language Models for Media Forensics

150Updated 1 year ago

ai-generated-image-detectionchatgpt-4deepfake-detectiondeepfake-imagesimage-forensicsmultimodal-llm

abdur75648/MedicalGPT

Medical Report Generation And VQA (Adapting XrayGPT to Any Modality)

Python146Updated 8 months ago

chatgptchatgpt4ollamallmllmsmedical-datasetmedical-imagingmedical-report-generationmedicalgptminigpt4multimodal-llmvicunavqavqa-datasetxraygpt

manhph2211/Q-HEART

Q-HEART: ECG Question Answering via Knowledge-Informed Multimodal LLMs (ECAI 2025)

Python140Updated 1 month ago

ecg-foundation-modelecg-qaecg-question-answeringecg-text-multimodal-learningmultimodal-llmmultimodal-retrieval-augmented-generationq-heartself-supervised-learning

andy9705/SumGD

[NAACL 2025 Findings] Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding

Python141Updated 1 month ago

hallucinationlanguage-priorlarge-vision-language-modelsmultimodalmultimodal-llm

jagennath-hari/SpatialFusion-LM

SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.

Python130Updated 3 months ago

3d-estimationcomputer-visiondepth-estimationfoundation-modelsmllmmultimodal-llmpoint-cloudsroboticsscene-understandingspatial-intelligencestereo-visiontransformervision-language-modelvision-transformerzero-shot-learning

iamaziz/chat_with_images

Streamlit app to chat with images using Multi-modal LLMs.

Python114Updated 1 year ago

llavallmsmultimodal-llmstreamlit

deeplearning-wisc/mllmshift-emi

Official implementation of ICML 2025 paper "Understanding Multimodal LLMs Under Distribution Shifts: An Information-Theoretic Approach"

Python111Updated 9 months ago

distribution-shiftinformation-theorymultimodal-llmrobustnessvision-language-model

autodistill/autodistill-llava

LLaVA base model for use with Autodistill.

Python84Updated 2 years ago

autodistillcomputer-visionllavamultimodal-llm

Utkarsh-Mishra444/Sparsely-Grounded-Long-Range-Navigation

AgentNav: Zero-shot sparsely grounded long-range visual navigation in real-world cities using Multimodal Large Language Models (MLLMs).

Python82Updated 2 weeks ago

agentnavcitynavcomputer-visionembodied-ailong-range-navigationmultimodal-llmvision-language-navigationvisual-navigation

WeChatCV/D-ORCA

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

Python70Updated 1 month ago

audio-visual-llmdialogue-centricmultimodal-llmomni-llmtsinghua-universityvideo-llmvideo-understanding

zhudotexe/kani-visionArchived

Kani extension for supporting vision-language models (VLMs). Comes with model-agnostic support for GPT-Vision and LLaVA.

Python70Updated 8 months ago

extensiongpt-visionkanilarge-language-modelsllavamultimodal-llmvision-language-model

idsudd/cachai

The future of AI is speaking Chilean, cachai?

Jupyter Notebook40Updated 7 months ago

build-in-publicchilellmmultimodal-llmopenai

fluxaster/MAI

一个基于 MaaFramework 与多模态大模型，通过视觉理解屏幕内容，利用 Planner-Executor-Verifier 三模式架构自动规划并执行任务的 GUI 智能体系统。

Python20Updated 3 months ago

ai-agentsautomationgui-agentmaaframeworkmultimodal-llmpythonrpa

Jiaxuan-Li/NEMO

NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

JavaScript20Updated 1 year ago

multimodal-llmrobustnessvision-and-languge

SannidhyaDas/AI-StoryWeaver

🎨 AI StoryWeaver — Where creativity meets Generative AI. Upload images and watch Google Gemini 2.5 craft and narrate human-like multilingual stories with emotion and voice. A hands-on showcase of multimodal intelligence, storytelling design, and AI-human expression.

Python20Updated 5 months ago

ai-projectai-storytellingcreative-aigemini-api-keygenaigoogle-geminillmmultimodal-aimultimodal-llmstory-generationtext-to-speech

Page 1 of 2