139 results for “topic:multi-modal-learning”
An open source implementation of CLIP.
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
A curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
[CVPR 2024 & NeurIPS 2024] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI
Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey
CVPR 2023-2024 Papers: Dive into advanced research presented at the leading computer vision conference. Keep up to date with the latest developments in computer vision and deep learning. Code included. ⭐ support visual intelligence development!
[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"
[ICCV 2023] Implicit Neural Representation for Cooperative Low-light Image Enhancement
[CVPR2020] Unsupervised Multi-Modal Image Registration via Geometry Preserving Image-to-Image Translation
Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.
The official repository of Achelous and Achelous++
[ICML 2023] Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining
Official pytorch repository for CG-DETR "Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding"
[ICCV-2023] The official code of Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
A detection/segmentation dataset with labels characterized by intricate and flexible expressions. "Described Object Detection: Liberating Object Detection with Flexible Expressions" (NeurIPS 2023).
[CVPR 2025] FLAIR: VLM with Fine-grained Language-informed Image Representations
Official PyTorch Code for Anchor Token Guided Prompt Learning Methods: [ICCV 2025] ATPrompt and [Arxiv 2511.21188] AnchorOPT
【CVPR2024】Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
Welcome to the Awesome Multi-Modal Object Re-Identification Repository! This repository is dedicated to curating and sharing the latest methods, datasets, and resources focused specifically on the domain of multi-modal object re-identification. It brings together cutting-edge research, tools, and papers aimed at advancing the study and application.
A python tool to perform deep learning experiments on multimodal remote sensing data.
[NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
[NeurIPS 2023] A faithful benchmark for vision-language compositionality
An official implementation of Advancing Radiograph Representation Learning with Masked Record Modeling (ICLR'23)
This repository contains code to download data for the preprint "MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning"
Pytorch version of the HyperDenseNet deep neural network for multi-modal image segmentation
[ICLR 2025] Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)
[NeurIPS 2024 Spotlight] Code for the paper "Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts"