55 results for “topic:visual-language-models”
a state-of-the-art-level open visual language model | 多模态预训练模型
🦀️ CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents. https://crab.camel-ai.org/
The official repo of One RL to See Them All: Visual Triple Unified Reinforcement Learning
Commanding robots using only Language Models' prompts
https://arxiv.org/abs/2312.10807
A curated list of Turkish AI models, datasets, papers
Official repository of FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis
Build a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
Implementation of the "Learn No to Say Yes Better" paper.
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
Official repo for "AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability"
Official Repo for the paper: VCR: Visual Caption Restoration. Check arxiv.org/pdf/2406.06462 for details.
Code implementation for paper titled "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision"
Awesome Memory-VLA: A curated list of Visual-Language-Action models with memory
Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models
This repository contains the data and code of the paper titled "IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models"
Universal Adversarial Perturbations for Vision-Language Pre-trained Models
Code for the paper "Towards Concept-based Interpretability of Skin Lesion Diagnosis using Vision-Language Models", IEEE ISBI 2024 (Oral).
Official implementation of OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping (ACM MM 2025)
[ICCVW 2025] Implementation for DAM-QA: Describe Anything Model for Visual Question Answering on Text-rich Images
This is the official implementation of ViCA2 (Visuospatial Cognitive Assistant 2), a multimodal large language model designed for advanced visuospatial reasoning. The repository also provides training scripts for the original ViCA model.
[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"
Chain of Images for Intuitively Reasoning
#3 Winner of Best Use of Zoom API at Stanford TreeHacks 2025! An AI-powered meeting assistant that captures video, audio and textual context from Zoom calls using multimodal RAG.
experimental: finetune smolVLM on COCO (without any special <locXYZ> tokens)
A benchmark for evaluating hallucinations in large visual language models
PaliGemma is a project created from scratch, based on a YouTube guide, to learn and demonstrate application/library/system creation. The project uses modern development approaches and best practices from the original tutorial.
Rust implementation of Google Paligemma with Candle
A Telegram bot for validating audio and video content using CV models, SR models, and VLMs, with deepfake detection leveraging metadata analysis.
official code repo for paper: Reasoning Path and Latent State Analysis for Mulit-view Visual Spatial Reasoning: A Cognitive Science Perspective