"topic:vision-language" — Search

217 results for “topic:vision-language”

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

Python9.8k1.0kUpdated just now

object-detectionopen-worldopen-world-detectionvision-languagevision-language-transformer

OFA-Sys/Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

Jupyter Notebook5.8k548Updated 23 hours ago

chineseclipcomputer-visioncontrastive-losscoreml-modelsdeep-learningimage-text-retrievalmulti-modalmulti-modal-learningnlppretrained-modelspytorchtransformersvision-and-language-pre-trainingvision-language

salesforce/BLIPArchived

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Jupyter Notebook5.7k762Updated 13 hours ago

image-captioningimage-text-retrievalvision-and-language-pre-trainingvision-languagevision-language-transformervisual-question-answeringvisual-reasoning

OFA-Sys/OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Python2.6k250Updated 1 hour ago

chineseimage-captioningmultimodalpretrained-modelspretrainingpromptprompt-tuningreferring-expression-comprehensiontext-to-image-synthesisvision-languagevisual-question-answering

AlibabaResearch/AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

C++1.8k199Updated 1 day ago

artificial-intelligencecomputer-visiondocumentdocument-analysisdocument-intelligencedocument-recognitiondocument-understandingdocumentaiend-to-end-ocrmultimodalmultimodal-deep-learningocrscene-text-detectionscene-text-detection-recognitionscene-text-recognitiontext-detectiontext-recognitionvision-languagevision-language-modelvision-language-transformer

2U1/Qwen-VL-Series-Finetune

An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.

Python1.7k200Updated 2 hours ago

multimodalqwen2-5-vlqwen2-vlqwen3-5qwen3-vlvision-languagevision-language-modelvlm

mbzuai-oryx/Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Python1.5k127Updated just now

chatbotclipgpt-4llamallavamulit-modalvicunavideo-chatboatvideo-conversationvision-languagevision-language-pretraining

llm-jp/awesome-japanese-llm

日本語LLMまとめ - Overview of Japanese LLMs

TypeScript1.4k42Updated 5 days ago

foundation-modelsgenerative-aigenerative-modelgenerative-modelsjapanesejapanese-languagejapanese-language-modeljapanese-llmlanguage-modellanguage-modelslarge-language-modellarge-language-modelsllmllm-japanesellmsmultimodalvision-and-languagevision-languagevision-language-model

OpenDriveLab/DriveLM

[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering

HTML1.3k83Updated just now

autonomous-drivingchain-of-thoughtgraph-of-thoughtslarge-language-modelsllmprompt-engineeringpromptingtree-of-thoughtsvision-language

OFA-Sys/ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Python1.1k72Updated 1 month ago

audio-languagecontrastive-lossfoundation-modelsmultimodalrepresentation-learningvision-and-languagevision-languagevision-transformer

TinyLLaVA/TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

Python96296Updated 5 hours ago

large-multimodal-modelsllamallavanlptinyllamatransformersvision-language

google-research/pix2seqArchived

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

Jupyter Notebook93973Updated just now

computer-visiondeep-learningobject-detectionpix2seqtensorflow2vision-language

SunzeY/AlphaCLIP

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Jupyter Notebook86858Updated 2 days ago

deep-learningmachine-learningvision-and-languagevision-languagevision-language-modelvision-transformer

mees/calvin

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Python852111Updated 2 hours ago

computer-visiondeep-learninggroundingmanipulationnatural-language-processingpytorchroboticsvisionvision-and-languagevision-language

mbzuai-oryx/LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

Python84861Updated 3 weeks ago

conversationllama-3-llavallama-3-visionllama3llama3-llavallama3-visionllavallava-llama3llava-phi3llmlmmsphi-3-llavaphi-3-visionphi3phi3-llavaphi3-visionvision-language

Algolzw/daclip-uir

[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.

Python80549Updated 1 day ago

all-in-one-image-restorationdeep-learningdiffusion-modelsface-inpaintingimage-deblurringimage-dehazingimage-denoisingimage-derainingimage-desnowingimage-restorationjpeg-artifacts-removallow-level-visionlow-light-image-enhancementpromptpytorchshadow-removalvision-language

longzw1997/Open-GroundingDino

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

Python800138Updated 20 hours ago

object-detectionopen-worldopen-world-detectionvision-language

AILab-CVC/SEED

Official implementation of SEED-LLaMA (ICLR 2024).

Python64233Updated 1 week ago

foundation-modelmultimodalvision-language

cliport/cliport

CLIPort: What and Where Pathways for Robotic Manipulation

Jupyter Notebook54194Updated 1 hour ago

clipcomputer-visiondeep-learninggroundingmanipulationnatural-language-processingpytorchrearrangementroboticsvisionvision-language

ChenDelong1999/RemoteCLIP

🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)

Jupyter Notebook52327Updated 4 hours ago

contrastive-language-image-pretrainingremote-sensingvision-language

airaria/Visual-Chinese-LLaMA-Alpaca

多模态中文LLaMA&Alpaca大语言模型（VisualCLA）

Python46137Updated 1 week ago

alpacachinesellamallmloramultimodalnlpvision-language

HUANGLIZI/LViT

[IEEE Transactions on Medical Imaging/TMI 2023] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

Python37736Updated just now

medical-image-analysismultimodal-learningpytorchsegmentationvision-language

zdou0830/METER

METER: A Multimodal End-to-end TransformER Framework

Python37634Updated 1 week ago

vision-language

zjysteven/lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Python36743Updated 3 days ago

finetuningfoundation-modelsinstruction-tuninglarge-language-modellarge-multimodal-modelsllavallava-nextmultimodalmultimodal-large-language-modelsqwen-vlvision-languagevisual-instruction-tuning

henghuiding/Vision-Language-Transformer

[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

Python35823Updated 2 weeks ago

iccv2021kerasreferring-segmentationtensorflowtpamitransformervision-languagevision-language-transformer

WisconsinAIVision/ViP-LLaVA

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Python33621Updated 1 month ago

chatbotclipcvpr2024foundation-modelsgpt-4gpt-4-visionllamallama2llavamulti-modalvision-languagevisual-prompting

ALEEEHU/World-Simulator

Simulating the Real World: Survey & Resources, which contains our survey "Simulating the Real World: A Unified Survey of Multimodal Generative Models" and Awesome-Text2X-Resources. Watch this repository for the latest updates! 🔥

33112Updated 15 hours ago

3d-generation4d-generationaigcawesome-listgenerative-modelsimage-generationmultimodalsurveytext2xvideo-generationvisionvision-languageworld-simulator

worldbench/awesome-vla-for-ad

🌐 Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

HTML32731Updated 19 hours ago

3dautonomous-drivingawesome-listembodied-ailarge-language-modelsllmmulti-modalmultimodal-large-language-modelsself-drivingvision-languagevision-language-actionvision-language-modelsvlavlm

movienet/movienet-tools

Tools for movie and video research

C++30539Updated 2 weeks ago

action-recognitioncomputer-visioncross-modalitydeep-learningmovieperson-analysisshot-detectionvideo-understandingvision-language

CASIA-IVA-Lab/VAST

[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Jupyter Notebook29818Updated 4 weeks ago

audio-languagecross-modality-pretrainingdatasetmultimodal-foundation-modelvision-audio-subtitle-textvision-language

Page 1 of 8