101 results for “topic:multi-modality”
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Algorithms and Publications on 3D Object Tracking
Parsing-free RAG supported by VLMs
Long-RL: Scaling RL to Long Sequences (NeurIPS 2025)
[CVPR 2025] MINIMA: Modality Invariant Image Matching
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
The open source implementation of Gemini, the model that will "eclipse ChatGPT" by Google
[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
[CVPR 2023] Collaborative Diffusion
An open-source implementation for training LLaVA-NeXT.
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
Official repository for VisionZip (CVPR 2025)
Effortless plugin and play Optimizer to cut model training costs by 50%. New optimizer that is 2x faster than Adam on LLMs.
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
An official PyTorch implementation of the CRIS paper
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
[ICCV2019] Robust Multi-Modality Multi-Object Tracking
Unifying Voxel-based Representation with Transformer for 3D Object Detection (NeurIPS 2022)
This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.
Official code for NeurIPS2023 paper: CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection
[ESSD 2025 & IEEE GRSS DFC 2025] Bright: A globally distributed multimodal VHR dataset for all-weather disaster response
[NeurIPS 2025 DB Track] 3EED: Ground Everything Everywhere in 3D
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
An open-source cloud-native of large multi-modal models (LMMs) serving framework.
[CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs
(NeurIPS 2022 CellSeg Challenge - 1st Winner) Open source code for "MEDIAR: Harmony of Data-Centric and Model-Centric for Multi-Modality Microscopy"