281 results for “topic:video-understanding”
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
A curated list of action recognition and related area resources
One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
[ICCV 2019] TSM: Temporal Shift Module for Efficient Video Understanding
An open-source toolbox for action understanding based on PyTorch
[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Awesome video understanding toolkits based on PaddlePaddle. It supports video data annotation tools, lightweight RGB and skeleton based action recognition model, practical applications for video tagging and sport action detection.
Code & Models for Temporal Segment Networks (TSN) in ECCV 2016
SALMONN family: A suite of advanced multi-modal LLMs
awesome grounding: A curated list of research papers in visual grounding
Temporal Segment Networks (TSN) in PyTorch
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
:fire: :fire: :fire: A paper list of some recent Computer Vision(CV) works
[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
temporal action detection with SSN
Official code for Goldfish model for long video understanding and MiniGPT4-video for short video understanding
OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.
[ICCV 2023 & TPAMI 2025] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
"VideoAgent: All-in-One Agentic Framework for Video Understanding, Editing, and Remaking"
A collection of recent video understanding datasets, under construction!
Temporal Segments LSTM and Temporal-Inception for Activity Recognition
✨✨[NeurIPS 2025] This is the official implementation of our paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension"
[CVPR 2021] TDN: Temporal Difference Networks for Efficient Action Recognition
[CVPRW'24] SoccerNet Game State Reconstruction: End-to-End Athlete Tracking and Identification on a Minimap (CVPR24 - CVSports workshop)
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
**Deep Video Discovery (DVD)** is a deep-research style question answering agent designed for understanding extra-long videos.
(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding