282 results for “topic:multimodal-ai”
🚀 Truly open-source AI avatar(digital human) toolkit for offline video generation and digital human cloning.
Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi.ai.
Resource, examples & tutorials for multimodal AI, RAG and agents using vector search and LLMs
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.
EVA OS — A real-time multimodal AIOS for next-generation hardware, enabling your devices being “alive” and as intelligent as a real brain.
This GitHub repository contains the complete code for building Business-Ready Generative AI Systems (GenAISys) from scratch. It guides you through architecting and implementing advanced AI controllers, intelligent agents, and dynamic RAG frameworks. The projects demonstrate practical applications across various domains.
🐊 Snappy's unique approach unifies vision-language late interaction with structured OCR for region-level knowledge retrieval. Like the project? Drop a star! ⭐
On-device AI for iOS & Android
Hub for researchers exploring VLMs and Multimodal Learning:)
🎭 Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync
Reference-first AI image editing desktop for developers (macOS, Tauri, Rust).
A web app that dynamically generates playable 'Spot the Difference' games from a single text prompt using a multimodal pipeline with Google's Gemini and Imagen models.
AI-powered tool to turn long videos into short, viral-ready clips. Combines transcription, speaker diarization, scene detection & 9:16 resizing — perfect for creators & smart automation.
[Nature Machine Intelligence] ImmunoStruct enables multimodal deep learning for immunogenicity prediction
ICML 2025 Papers: Dive into cutting-edge research from the premier machine learning conference. Stay current with breakthroughs in deep learning, generative AI, optimization, reinforcement learning, and beyond. Code implementations included. ⭐ support the future of machine learning research!
Server-side video workflows for agents: ingest, understand, search, edit, stream.
Neocortex Unity SDK for Smart NPCs and Virtual Assistants
Learn how multimodal AI merges text, image, and audio for smarter models
Open-source, AI-enhanced CAT tool with multi-LLM support, translation memory, glossary management, Superbench translation quality benchmarking, ‘Superlookup’ concordance across TMs/glossaries/web resources, voice commands, and seamless integration with leading CAT tools. Experimental Okapi Framework sidecar for industrial-strength file extraction.
Enterprise-ready solution leveraging multimodal Generative AI (Gen AI) to enhance existing or new applications beyond text—implementing RAG, image classification, video analysis, and advanced image embeddings.
Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language. Production-ready with multi-AI support, vector storage, and multi-database coordination.
A demo multimodal AI chat application built with Streamlit and Google's Gemini model. Features include: secure Google OAuth, persistent data storage with Cloud SQL (PostgreSQL), and intelligent function calling. Includes a persona-based newsletter engine to deliver personalized insights.
This is a fully autonomous, self-operating computer automation system designed to automate tasks on Windows without any user interaction. It runs scheduled or trigger-based workflows using Python, system tools, and smart agents — ideal for repetitive tasks, bots, or self-executing pipelines.
Vision Foundation Models: SAM, ViT, CLIP, DINOv2, object detection, segmentation, and multimodal AI for computer vision.
This is the public repository for "LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval", which is published on EMNLP 2025 Main.
🚀 First multimodal AI-powered visual testing plugin for Claude Code. AI that can SEE your UI! 10x faster frontend development with closed-loop testing, browser automation, and Claude 4.5 Sonnet vision.
AI Framework for Remote Sensing Image Analysis using RAG - 88%+ accuracy, multi-modal queries, ChatGPT-like interface
A multimodal live AI assistant designed to enhance the browsing experience using Gemini.
Open-source Android framework for low-latency, LLM-driven multimodal interaction on Pepper. Uses end-to-end speech-to-speech models and extensive Function Calling for agentic robot control (navigation, gaze, vision, touch). Also runs on regular Android devices.
A persistent narrative universe generator that creates illustrated story chapters using AI. It combines text generation (via multiple providers: OpenAI, Groq, Together AI, HuggingFace, OpenRouter) with image generation (via Replicate, HuggingFace, Pollinations, Fal.ai) to produce coherent, ongoing narratives with scene illustrations.