97 results for “topic:vlms”
Anomaly detection related books, papers, videos, and toolboxes. Last update late 2025 for LLM and VLM works!
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.
🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantization, MXFP4, NVFP4, GGUF, and adaptive schemes.
Official repository for VisionZip (CVPR 2025)
[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models
Scala client for OpenAI API and other major LLM providers
[CVPR 2024] Official implementation of "ViTamin: Designing Scalable Vision Models in the Vision-language Era"
This repository collects research papers of large Foundation Models for Scenario Generation and Analysis in Autonomous Driving. The repository will be continuously updated to track the latest update.
Open-source tools for training and evaluating Vision Language Models for OCR
Official Repository of OmniCaptioner
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders [Technical Report]
[NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
[CVPR2025] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories
[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Hub for researchers exploring VLMs and Multimodal Learning:)
[NeurIPS'24] Official PyTorch Implementation of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Benchmarking Vision-Language Models on OCR tasks in Dynamic Video Environments
Code for our ICCV 2025 paper "CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers."
Implementing scalable LLMs in pure JAX (no third-party libraries)
[ICASSP 2024] The official repo for Harnessing the Power of Large Vision Language Models for Synthetic Image Detection
SurgLaVi: Official repository
A comprehensive guide to navigating the world of generative artificial intelligence!
[COLM 2025] JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
This is a project on visual spatial reasoning tasks-SIBench
Convert documents, images to high-quality Markdown using Vision LLMs. Built for RAG ingestion pipelines.
A minimal, hackable Vision-Language Model built on Karpathy’s nanochat — add image understanding and multimodal chat for under $200 in compute.
We introduce VLM-Mamba, the first Vision-Language Model built entirely on State Space Models (SSMs), specifically leveraging the Mamba architecture.
这是我们工程学导论(ME1221)课程项目的版本管理仓库,本项目旨在实现一个基于CLIPs的支持语义客制化的智能养老摄像头模块的硬件支持性开发、后端VLM开发以及前端开发。技术上,我们使用大规模语义预训练的CLIP模型以及FG-CLIP2模型,采用ViTs作为视觉编码器,Transformers作为语义编码器,zero-shot地进行场景识别,从而实现高度个性化的智能功能。