217 results for “topic:vision-language”
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud.
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.
日本語LLMまとめ - Overview of Japanese LLMs
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
A Framework of Small-scale Large Multimodal Models
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Official implementation of SEED-LLaMA (ICLR 2024).
CLIPort: What and Where Pathways for Robotic Manipulation
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
[IEEE Transactions on Medical Imaging/TMI 2023] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
METER: A Multimodal End-to-end TransformER Framework
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Simulating the Real World: Survey & Resources, which contains our survey "Simulating the Real World: A Unified Survey of Multimodal Generative Models" and Awesome-Text2X-Resources. Watch this repository for the latest updates! 🔥
🌐 Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
Tools for movie and video research
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset