SIBench/Awesome-Visual-Spatial-Reasoning
This is a project about visual spatial reasoning.
Awesome Visual Spatial Reasoning
Yuhan Wu4, Rundi Cui4, Binghao Ran4, Zhang Zaibin4, Zhedong Zheng3, Zhipeng Zhang1,
Yifan Wang4, Lin Song2, Lijun Wang4, Yanwei Liโ๏ธ5, Ying Shan2, Huchuan Lu4,
1SJTU, 2ARC Lab, Tencent PCG, 3UM, 4DLUT, 5CUHK * Equal Contributions ๐ Project Lead โ๏ธ Corresponding Author
๐ค Dataset ย ย | ย ย ๐ Leaderboardย ย | ย ย ๐ Survey ย ย | ย ย ๐ฏ Code ย ย | ย ย ๐ arXiv
News and Updates
- ๐๐๐25.9.23 - Preprint a survey article on visual spatial reasoning tasks.
- ๐ฏ๐ฏ๐ฏ25.9.23 - Release comprehensive evaluation results of mainstream models in visual spatial reasoning.
- ๐๐๐25.9.15 - Open-source evaluation data for visual spatial reasoning tasks.
- ๐คฉ๐ฅณ๐ค25.9.15 - Open-source evaluation toolkit.
- โ๏ธ๐ฆพ๐ผ25.6.28 - Collected the "Datasets" section.
- ๐๐โโ๏ธ๐โโ๏ธ25.6.16 - The "Awesome Visual Spatial Reasoning" project is now live!
- ๐๐ฎ๐ป25.6.12 - The project has conducted research and collected 100 relevant works.
- ๐โโ๏ธ๐โโ๏ธ๐25.6.10 - We launches a review project on visual spatial reasoning.
Open-source evaluation toolkit
Evaluation of SOTA Models on 23 Visual Spatial Reasoning Tasks.
- git clone https://github.com/song2yu/SIBench-VSR.git
- Refer to the README.md for more detailsContributing
We welcome contributions to this repository! If you would like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch with your changes.
- Submit a pull request with a clear description of your changes.
You can also open an issue if you have anything to add or comment.
Please feel free to contact us (SongsongYu203@163.com).
Overview
The research community is increasingly focused on the visual spatial reasoning (VSR) abilities of Vision-Language Models (VLMs). Yet, the field lacks a clear overview of its evolution and a standardized benchmark for evaluation. Current assessment methods are disparate and lack a common toolkit. This project aims to fill that void. We are developing a unified, comprehensive, and diverse evaluation toolkit, along with an accompanying survey paper. We are actively seeking collaboration and discussion with fellow experts to advance this initiative.
Task Explanation
Visual spatial understanding is a key task at the intersection of computer vision and cognitive science. It aims to enable intelligent agents (such as robots and AI systems) to parse spatial relationships in the environment through visual inputs (images, videos, etc.), forming an abstract cognition of the physical world. In Embodied Intelligence, it serves as the foundation for agents to achieve the "perception-decision-action" loopโonly by understanding attributes like object positions, distances, sizes, and orientations in space can intelligent agents navigate environments, manipulate objects, or interact with humans.
Timeline
Citation
If you find this project useful, please consider citing:
@article{sibench2025,
title={How Far are VLMs from True Visual Spatial Intelligence? A Benchmark-Driven Perspective},
author={Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu},
journal={arXiv preprint arXiv:2509.18905},
year={2025}
}
Table of Contents
To facilitate the community's quick understanding of visual-spatial reasoning, we first categorized it by input modalities into Single image, Monocular Video, and Multi-View Images. We also surveyed other input modalities such as point clouds, as well as specific applications like embodied robotics. These are temporarily grouped under "Others," and we will conduct a more detailed sorting in the future.






































































































































































































































