"topic:multimodal-learning" — Search

ICCV 2023-2025 Papers: Discover cutting-edge research from ICCV 2023-25, the leading computer vision conference. Stay updated on the latest in computer vision and deep learning, with code included. ⭐ support visual intelligence development!

Python96947Updated 3 days ago

3d-graphics3d-reconstructionbiometricscomputer-visiondatasetsdeep-learningexplainable-aiface-recognitiongesture-recognitioniccviccv2023iccv2025image-processingimage-synthesismultimodal-learningpattern-recognitionphotogrammetrypose-estimationtransfer-learningvideo-synthesis

richard-peng-xia/awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

92580Updated 6 days ago

large-language-modelslarge-multimodal-modelsmedical-imagingmedical-report-generationmultimodal-deep-learningmultimodal-large-language-modelsmultimodal-learningvisual-question-answering

declare-lab/multimodal-deep-learning

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

OpenEdge ABL907165Updated 1 day ago

multimodal-deep-learningmultimodal-interactionsmultimodal-learningmultimodal-sentiment-analysis

HenryNdubuaku/maths-cs-ai-compendium

Become a cracked AI/ML Research Engineer

JavaScript848120Updated just now

ai-textbookalgorithmsartificial-intelligencecomputer-sciencecomputer-visiondeep-learningjaxlinear-algebramachine-learningmachine-learning-algorithmsmathmathematicsmultimodal-learningnlpprobabilitypythonreinforcement-learningspeech-processingstatistics

HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis

Papers, code and datasets about deep learning and multi-modal learning for video analysis

836174Updated 2 days ago

deep-learningmachine-learningmultimodal-learningpapervideo-analysisvideo-classificationvideo-dataset

mlfoundations/MINT-1T

🍃 MINT-1T: A one trillion token multimodal interleaved dataset.

83018Updated 1 day ago

datasetmultimodal-learningvision-language-model

henghuiding/ReLA

[CVPR 2023 Highlight & IJCV 2026] GRES: Generalized Referring Expression Segmentation

Python68722Updated 3 days ago

cvpr2023multimodal-learningreferring-expression-comprehensionreferring-expression-segmentationreferring-image-segmentationvision-language-transformer

georgian-io/Multimodal-Toolkit

Multimodal model for text and tabular data with HuggingFace transformers as building block for text data

Python61792Updated 3 weeks ago

huggingface-transformersmultimodal-learningnatural-language-processingtabular-datatransformer

pliang279/MultiBench

[NeurIPS 2021] Multiscale Benchmarks for Multimodal Representation Learning

HTML61391Updated 4 days ago

computer-visiondeep-learninghealthcaremachine-learningmultimodal-learningnatural-language-processingrepresentation-learningroboticsspeech-processing

sangminwoo/awesome-vision-and-language

A curated list of awesome vision and language resources (still under construction... stay tuned!)

56045Updated 3 weeks ago

awesomeawesome-listmultimodal-learningvision-and-language

MMMU-Benchmark/MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

Python54849Updated 6 days ago

computer-visiondeep-learningdeep-neural-networksevaluationfoundation-modelslarge-language-modelslarge-multimodal-modelsllmllmsmachine-learningmultimodalmultimodal-deep-learningmultimodal-learningmultimodalitynatural-language-processingquestion-answeringstemvisual-question-answering

henghuiding/MeViS

[ICCV 2023 & TPAMI 2025] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

Python52121Updated 1 week ago

mevis-datasetmose-datasetmultimodal-learningreferring-expression-comprehensionreferring-expression-segmentationreferring-video-object-segmentationvideo-understanding

DmitryRyumin/ICASSP-2023-24-Papers

ICASSP 2023-2024 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023-24 conferences. Explore the latest advancements in acoustics, speech and signal processing. Code included. Star the repository to support the advancement of audio and signal processing!

Python51923Updated 1 day ago

asrdenoisingdomain-adaptationface-recognitiongenerative-modelsicasspicassp2023icassp2024image-generationkeyword-spottinglanguage-modelingmultimodal-learningmusic-generationself-supervised-learningsemantic-segmentationsignal-processingsignal-restorationspeech-recognitionspoken-language-understandingvad

subho406/OmniNet

Official Pytorch implementation of "OmniNet: A unified architecture for multi-modal multi-task learning" | Authors: Subhojeet Pramanik, Priyanka Agrawal, Aman Hussain

Python51359Updated 3 months ago

artificial-intelligencedeep-learningimage-captioningmachine-learningmultimodal-learningmultitask-learningneural-networknlptransformervideo-recognition

microsoft/XPretrain

Multi-modality pre-training

Python51036Updated 2 weeks ago

computer-visionmultimediamultimodal-learningnlppre-training

llm-lab-org/Multimodal-RAG-Survey

A Survey on Multimodal Retrieval-Augmented Generation

48826Updated 1 day ago

multimodal-learningragretrieval-augmented-generation

pykale/pykale

Knowledge-Aware machine LEarning (KALE): accessible machine learning from multiple sources for interdisciplinary research, part of the 🔥PyTorch ecosystem. ⭐ Star to support our work!

Python47972Updated 2 weeks ago

computer-visiondata-sciencedeep-learningdomain-adaptationgraph-analysisknowledge-aware-learningmachine-learningmedical-image-analysismeta-learningmultimodalmultimodal-learningpythonpytorchtransfer-learning

njustkmg/OMML

Multi-Modal learning toolkit based on PaddlePaddle and PyTorch, supporting multiple applications such as multi-modal classification, cross-modal retrieval and image caption.

Python47775Updated 2 months ago

classificationcrossmodal-retrievalimagecaptioningmultimodalmultimodal-learningpaddlepaddlepythonpytorch

UCSC-VLAA/OpenVision

OpenVision (ICCV 2025), OpenVision 2 (CVPR 2026), and OpenVision 3

Python46524Updated 9 hours ago

jaxmultimodal-learningtpuvision-encodervision-foundation-model

Pointcept/GPT4Point

[CVPR'24 Highlight] GPT4Point: A Unified Framework for Point-Language Understanding and Generation.

Python44130Updated 12 hours ago

3d-generationllmmultimodal-learning

HUANGLIZI/LViT

[IEEE Transactions on Medical Imaging/TMI 2023] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

Python37636Updated 1 month ago

medical-image-analysismultimodal-learningpytorchsegmentationvision-language

kyegomez/CM3Leon

An open source implementation of "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", an all-new multi modal AI that uses just a decoder to generate both text and images

Python36516Updated 2 weeks ago

attentionattention-is-all-you-needdalleimagegenerationmultimodalmultimodal-learningmultimodality

HenryHZY/Awesome-Multimodal-LLM

Research Trends in LLM-guided Multimodal Learning.

35616Updated 2 weeks ago

in-context-learninginstruction-tuninglarge-language-modelsllmmultimodalmultimodal-large-language-modelsmultimodal-learningparameter-efficient-learningparameter-efficient-tuning

Page 1 of 17