76 results for “topic:vision-ai”
Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
AI-powered companion for Homebox. Snap photos and let AI auto-identify and catalog items into your inventory, then use the AI Chat to organize, search, and update your inventory effortlessly.
🐊 Snappy's unique approach unifies vision-language late interaction with structured OCR for region-level knowledge retrieval. Like the project? Drop a star! ⭐
Keep track of what has happened in AI this month. Discover the best AI/LLM resources and news for this month.
📺 Instill Console for 🔮 Instill Core: https://github.com/instill-ai/instill-core
Open-source implementation of Pomelli project by Google
Gemini Vision & Image Generation MCP for Claude Desktop and Claude Code
AI-powered video understanding — extract key frames from YouTube, Bilibili & any video page, get structured summaries via vision AI. Supports yt-dlp, Playwright, cloud browsers. AI驱动的视频理解-从YouTube, Bilibili和任何视频页面提取关键帧,通过VLM获得结构化摘要。支持yt-dlp、Playwright和一些常见云浏览器。
[CVPRW'25] Official Code For "SK-RD4AD: Skip-Connected Reverse Distillation for One-Class Anomaly Detection"
This repository demonstrates YOLOv8-based license plate recognition with GCP Vision AI integration, enabling versatile real-world applications like vehicle identification, traffic monitoring, and geospatial analysis while capturing vital media metadata for enhanced insights.
🌀 The world's first emotionally intelligent CLI that thinks, creates, and empathizes with developers. Autonomous AI with Vision, Dream Engine, and Emotional Intelligence.
Bidirectional Markdown↔PDF converter with AI-powered vision. MD→PDF with beautiful themes, PDF→MD with LLaVA - open source & privacy-first
AI-Powered Kahoot Auto-Answer Chrome Extension — supports every question type
MDDenseResNet : Enhanced Malware Detection Using DNNs
MCQ_Grading_Bot is an AI-powered tool that grades solved MCQ exam sheets from images using Gemini Vision. It extracts student info, checks answers, calculates score, and displays detailed results—all through a simple Gradio interface in Colab.
Vision-powered UX simulation engine for Claude Code. Renders pages in a real browser, captures 36+ screenshots across viewports, clicks through interactive elements, maps CTA funnels, tests signup flows, and scores across 7 UX dimensions. Replaces manual user testing with automated multi-viewport analysis.
AI-powered health platform with multi-LLM engine (GPT-4o, Claude, Gemini). Workout generation, medication tracking with OCR, vision AI, gamification with leaderboards/rewards. Self-hosted, privacy-first.
Hybrid AI orchestration stack combining local LLMs (Ollama), vector search (Qdrant), and Azure AI Foundry for scalable RAG, Agentic AI, and Vision. Built with .NET 8 and Python.
General vision AI defect detection engine for MLops process/simulations
Vision Agent Analyst is a professional web application for automatic analysis of visual data (diagrams, interfaces, documents) using multimodal artificial intelligence models.
Spring Boot + Gemini AI integration using Ollama Cloud with support for text and image chat APIs.
Backend проекта Pinterest команды OND team
People detection and notifications based on the Raspberry Pi + AI Camera
qwen3-vl-2b-instruct performing step by step tasks confirming normalized coordinations usage and tools executions
Eagle-Eye-AI is a project designed for the Kria KR260 board that enables AI-driven camera tracking and face detection.
Model-Agnostic Task Architecture — a task-centric computer vision framework.
AI-powered pipeline that converts YouTube videos into polished articles using vision-based transcription - captures code, terminal output, and on-screen text that subtitles miss
Multi-Agent Vision-Driven Automation Showcase: CUA + Playwright + LangChain
Multi-modal AI agent that extracts information from PDFs, images, and documents to answer questions. Combines vision models with RAG architecture for intelligent document understanding. Upload any file and chat with your documents. Built with LangChain, vision APIs, and vector embeddings.
Vision AI "Cortex" for Agents. A Playwright-based MCP Server & API that captures screenshots with ground-truth DOM extraction and full auth state injection. Containerized.