"topic:multimodal-ai" — Search

282 results for “topic:multimodal-ai”

🚀 Truly open-source AI avatar(digital human) toolkit for offline video generation and digital human cloning.

ai-avatarai-avatarscloningcloning-tooldigital-humanmultimodal-aivideo-generationvideo-synthesis

Multi-modal Generative Media Skills for AI Agents (Claude Code, Cursor, Gemini CLI). High-quality image, video, and audio generation powered by muapi.ai.

Shell2.9k313Updated 2 hours ago

agent-toolsai-agentsai-artai-musicai-videoclaude-codefluxgenerative-aiimage-generationklingmcpmidjourneymuapimultimodal-aiskillssunotext-to-audiotext-to-imagetext-to-videovideo-generation

lancedb/vectordb-recipes

Resource, examples & tutorials for multimodal AI, RAG and agents using vector search and LLMs

Jupyter Notebook930166Updated 3 days ago

agentsaideep-learningembeddingsfine-tuninggptgpt-4-visionlancedblangchainllama-indexllmsmachine-learningmultimodalmultimodal-aiopenairagvector-database

waybarrios/vllm-mlx

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

Python54374Updated 3 hours ago

anthropicapple-siliconaudio-processingclaude-codecomputer-visionimage-understandinginferencellmmachine-learningmacosmllmmlxmultimodal-aispeech-to-textstttext-to-speechttsvideo-understandingvision-language-modelvllm

AutoArk/EVA-OS

EVA OS — A real-time multimodal AIOS for next-generation hardware, enabling your devices being “alive” and as intelligent as a real brain.

TypeScript41614Updated 3 days ago

aiosmultimodal-aireal-timeroboticssmart-devicesvoice-assistantwebrtc

Denis2054/Building-Business-Ready-Generative-AI-Systems

This GitHub repository contains the complete code for building Business-Ready Generative AI Systems (GenAISys) from scratch. It guides you through architecting and implementing advanced AI controllers, intelligent agents, and dynamic RAG frameworks. The projects demonstrate practical applications across various domains.

Jupyter Notebook15247Updated 1 week ago

agentic-aiai-agentsai-architecturechain-of-thoughtcontext-engineeringdeepseek-r1enterprise-aigenerative-ai-systemshuman-centered-aillmsmulti-agent-systemsmultimodal-airag

athrael-soju/Snappy

🐊 Snappy's unique approach unifies vision-language late interaction with structured OCR for region-level knowledge retrieval. Like the project? Drop a star! ⭐

Python8315Updated 3 days ago

colpalicomputer-visiondeepseek-ocrdockerdocument-retrievaldocument-understandingfastapimultimodal-aimultivector-searchnextjspdf-searchpythonqdrantragtypescriptvector-databasevector-searchvision-aivisual-retrieval

sbhjt-gr/InferrLM

On-device AI for iOS & Android

TypeScript6612Updated 23 hours ago

anthropicdeepseekdocument-processingedge-aiembeddingsgeminiggufhttp-serverllama-cppllamacpplocal-inferencelocal-llmmultimodal-aion-device-aiopenairag

thubZ09/vision-language-model-research

Hub for researchers exploring VLMs and Multimodal Learning:)

625Updated 1 week ago

computer-visiondeep-learningmachine-learningmultimodal-aimultimodal-deep-learningmultimodal-large-language-modelsmultimodal-learningnlpresearchvision-languagevlms

kiranbaby14/TalkMateAI

🎭 Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync

TypeScript6114Updated 1 week ago

fastapiflash-attention-2huggingfacekokoro-ttsmultimodal-ainextjssmolvlmvlmwebsocketwhisper-ai

kevinshowkat/brood

Reference-first AI image editing desktop for developers (macOS, Tauri, Rust).

Rust540Updated 2 days ago

ai-image-editingdesktop-appimage-generationmacosmultimodal-aiopenrouterpromptlessreference-firstrusttauri

seehiong/prompt-to-puzzle

A web app that dynamically generates playable 'Spot the Difference' games from a single text prompt using a multimodal pipeline with Google's Gemini and Imagen models.

TypeScript500Updated 4 days ago

appwritecomputer-visiongamegenerative-aigenerative-artgoogle-ai-studiogoogle-cloud-rungoogle-geminigoogle-imagenhackathon-projecthtml5-canvasmultimodal-aipuzzle-gamereactspot-the-differencetext-to-imagetypescript

alperensumeroglu/ai-clips-maker

AI-powered tool to turn long videos into short, viral-ready clips. Combines transcription, speaker diarization, scene detection & 9:16 resizing — perfect for creators & smart automation.

Python464Updated 2 days ago

ai-video-summarizationaudio-analysisautomatic-speech-recognitiondeep-learning-pipelinesface-trackingffmpeg-pythonhuggingface-pipelinesintelligent-video-editingmedia-processingml-pipelinemultimodal-ainlp-video-analysisopenai-whisperpyannote-audiospeaker-diarizationtemporal-segmentationvideo-clip-generationvideo-resizingvideo-scene-detectionvideo-transcription

KrishnaswamyLab/ImmunoStruct

[Nature Machine Intelligence] ImmunoStruct enables multimodal deep learning for immunogenicity prediction

Python4215Updated 4 days ago

ai-for-biologyai-for-scienceai4sciencealphafoldalphafold2cancer-immunologycomputational-biologycontrastive-learningdeep-learningimmunogenicityimmunogenicity-predictionimmunologyinfectious-diseasesmultimodalmultimodal-aimultimodal-deep-learningmultimodal-learningnature-machine-intelligencepredictionself-supervised-learning

DmitryRyumin/ICML-2025-Papers

ICML 2025 Papers: Dive into cutting-edge research from the premier machine learning conference. Stay current with breakthroughs in deep learning, generative AI, optimization, reinforcement learning, and beyond. Code implementations included. ⭐ support the future of machine learning research!

371Updated 1 week ago

ai-researchdeep-learningdiffusion-modelsgenerative-aigraph-learningicmlicml-2025machine-learningmultimodal-aioptimizationreinforcement-learningreinforcement-learning-algorithms

video-db/skills

Server-side video workflows for agents: ingest, understand, search, edit, stream.

Python342Updated 15 hours ago

aiampclaudeclaude-codecodexmultimodal-aiopencodeperceptionrealtime-videoskillsvideo-processingvideodbvlm

neocortex-link/neocortex-unity-sdk

Neocortex Unity SDK for Smart NPCs and Virtual Assistants

C#322Updated 1 week ago

aiai-agentai-agentsai-toolsaiagentaiagentsconversational-aigame-aigame-developmentmultimodal-ainpcnpcssmart-agentsmart-agentssmart-npcsmart-npcsunity-llmunityllm

sinanuozdemir/oreilly-multimodal-ai

Learn how multimodal AI merges text, image, and audio for smarter models

Jupyter Notebook3014Updated 1 month ago

dalle-3deepgramdiffusiondreamboothgenerative-ailivekitllama3llavamultimodalmultimodal-aiopenaistable-diffusion

michaelbeijer/Supervertaler

Open-source, AI-enhanced CAT tool with multi-LLM support, translation memory, glossary management, Superbench translation quality benchmarking, ‘Superlookup’ concordance across TMs/glossaries/web resources, voice commands, and seamless integration with leading CAT tools. Experimental Okapi Framework sidecar for industrial-strength file extraction.

Python267Updated 21 hours ago

ahkaicafetrancat-toolclaudecontext-aware-translationgeminillmlocalizationmemoqmultimodal-ainlpprompt-engineeringpromptspythontranslationtranslation-memorytranslation-tool

microsoft/multimodal-ai

Enterprise-ready solution leveraging multimodal Generative AI (Gen AI) to enhance existing or new applications beyond text—implementing RAG, image classification, video analysis, and advanced image embeddings.

HCL2412Updated 2 weeks ago

aiazureazure-aienterprise-aimultimodal-aipythonvideo-analysis

byerlikaya/SmartRAG

Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language. Production-ready with multi-AI support, vector storage, and multi-database coordination.

C#216Updated 13 hours ago

csharpdocument-processingdotnetenterprise-aillmmulti-databasemultimodal-airagsemantic-searchvector-database

doepking/gemini_multimodal_demo

A demo multimodal AI chat application built with Streamlit and Google's Gemini model. Features include: secure Google OAuth, persistent data storage with Cloud SQL (PostgreSQL), and intelligent function calling. Includes a persona-based newsletter engine to deliver personalized insights.

Python204Updated 1 month ago

cloud-runcloud-sqlgemini-aigoogle-cloudmultimodal-aipostgresqlsmtp

masfaatanveer/Agentic-AI-Computer

This is a fully autonomous, self-operating computer automation system designed to automate tasks on Windows without any user interaction. It runs scheduled or trigger-based workflows using Python, system tools, and smart agents — ideal for repetitive tasks, bots, or self-executing pipelines.

Python185Updated 2 days ago

agentagentic-aiai-agentautomationautonomous-systemautopilotbotclaude-3gemini-pro-visiongpt-4omultimodal-aiollamapythonself-operatingtask-runnerwindows-automation

umitkacar/awesome-vision-models

Vision Foundation Models: SAM, ViT, CLIP, DINOv2, object detection, segmentation, and multimodal AI for computer vision.

Makefile171Updated 1 week ago

clipcomputer-visiondinov2foundation-modelsgrounding-dinoimage-recognitioninstance-segmentationmaemultimodal-aiobject-detectionopen-vocabularysamsemantic-segmentationvision-transformersvisual-understandingvityolozero-shot-learning

joohyung00/lilac

This is the public repository for "LILaC: Late Interacting in Layered Component Graph for Open-domain Multimodal Multihop Retrieval", which is published on EMNLP 2025 Main.

Python163Updated 2 weeks ago

multimodal-airetrieval-augmented-generation

hemangjoshi37a/claude-code-frontend-dev

🚀 First multimodal AI-powered visual testing plugin for Claude Code. AI that can SEE your UI! 10x faster frontend development with closed-loop testing, browser automation, and Claude 4.5 Sonnet vision.

Python152Updated 8 hours ago

ai-poweredai-testingai-visionanthropicbrowser-automationclaude-aiclaude-codeclaude-pluginclosed-loopdeveloper-toolsfrontend-testingmultimodal-ainextjsplaywrightpuppeteerreactscreenshot-testingtesting-toolsui-testingvisual-testing

debanjan06/geospatial-rag

AI Framework for Remote Sensing Image Analysis using RAG - 88%+ accuracy, multi-modal queries, ChatGPT-like interface

Python123Updated 6 days ago

academic-researchclipcomputer-visionearth-observationembeddingsgeospatiallangchainmachine-learningmultimodal-aipytorchragremote-sensing

fmind/kate

A multimodal live AI assistant designed to enhance the browsing experience using Gemini.

Python110Updated 1 week ago

agentbrowser-extensiongeminigenerative-ailivelive-assistantmultimodalmultimodal-ai

studerus/pepper-android-realtime-chat

Open-source Android framework for low-latency, LLM-driven multimodal interaction on Pepper. Uses end-to-end speech-to-speech models and extensive Function Calling for agentic robot control (navigation, gaze, vision, touch). Also runs on regular Android devices.

Kotlin100Updated 4 days ago

androidfunction-callinghuman-robot-interactionkotlinlarge-language-modelsmultimodal-aiopenaipepper-robotqisdkrealtime-apisocial-roboticsspeech-to-speechvoice-assistant

jonmartin721/living-story-world

A persistent narrative universe generator that creates illustrated story chapters using AI. It combines text generation (via multiple providers: OpenAI, Groq, Together AI, HuggingFace, OpenRouter) with image generation (via Replicate, HuggingFace, Pollinations, Fal.ai) to produce coherent, ongoing narratives with scene illustrations.

Python90Updated 1 month ago

ai-image-generationai-storytellingfastapigenerative-aiinteractive-fictionmultimodal-aipythonstory-generationtext-generationweb-application

Page 1 of 10