Awesome Visual Spatial Reasoning

Songsong Yu^*1,2, Yuxin Chen^🌟*2, Hao Ju^*3, Lianjie Jia^*4, Fuxi Zhang⁴, Shaofei Huang³,
Yuhan Wu⁴, Rundi Cui⁴, Binghao Ran⁴, Zhang Zaibin⁴, Zhedong Zheng³, Zhipeng Zhang¹,
Yifan Wang⁴, Lin Song², Lijun Wang⁴, Yanwei Li^✉️5, Ying Shan², Huchuan Lu⁴,

¹SJTU, ²ARC Lab, Tencent PCG, ³UM, ⁴DLUT, ⁵CUHK * Equal Contributions 🌟 Project Lead ✉️ Corresponding Author

🤗 Dataset | 🌐 Leaderboard | 📊 Survey | 🎯 Code | 📄 arXiv

News and Updates

📜📜📜25.9.23 - Preprint a survey article on visual spatial reasoning tasks.
🎯🎯🎯25.9.23 - Release comprehensive evaluation results of mainstream models in visual spatial reasoning.
🙌👏👐25.9.15 - Open-source evaluation data for visual spatial reasoning tasks.
🤩🥳🤗25.9.15 - Open-source evaluation toolkit.
✍️🦾💼25.6.28 - Collected the "Datasets" section.
🏃🏃‍♀️🏃‍♂️25.6.16 - The "Awesome Visual Spatial Reasoning" project is now live!
👏🕮💻25.6.12 - The project has conducted research and collected 100 relevant works.
🙋‍♀️🙋‍♂️🙋25.6.10 - We launches a review project on visual spatial reasoning.

Open-source evaluation toolkit

Evaluation of SOTA Models on 23 Visual Spatial Reasoning Tasks.

Code Usage:

- git clone https://github.com/song2yu/SIBench-VSR.git
- Refer to the README.md for more details

Contributing

We welcome contributions to this repository! If you would like to contribute, please follow these steps:

Fork the repository.

Create a new branch with your changes.

Submit a pull request with a clear description of your changes.

You can also open an issue if you have anything to add or comment.

Please feel free to contact us (SongsongYu203@163.com).

Overview

The research community is increasingly focused on the visual spatial reasoning (VSR) abilities of Vision-Language Models (VLMs). Yet, the field lacks a clear overview of its evolution and a standardized benchmark for evaluation. Current assessment methods are disparate and lack a common toolkit. This project aims to fill that void. We are developing a unified, comprehensive, and diverse evaluation toolkit, along with an accompanying survey paper. We are actively seeking collaboration and discussion with fellow experts to advance this initiative.

Task Explanation

Visual spatial understanding is a key task at the intersection of computer vision and cognitive science. It aims to enable intelligent agents (such as robots and AI systems) to parse spatial relationships in the environment through visual inputs (images, videos, etc.), forming an abstract cognition of the physical world. In Embodied Intelligence, it serves as the foundation for agents to achieve the "perception-decision-action" loop—only by understanding attributes like object positions, distances, sizes, and orientations in space can intelligent agents navigate environments, manipulate objects, or interact with humans.

Timeline

Citation

If you find this project useful, please consider citing:

@article{sibench2025,
  title={How Far are VLMs from True Visual Spatial Intelligence? A Benchmark-Driven Perspective},
  author={Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu},
  journal={arXiv preprint arXiv:2509.18905},
  year={2025}
}

To facilitate the community's quick understanding of visual-spatial reasoning, we first categorized it by input modalities into Single image, Monocular Video, and Multi-View Images. We also surveyed other input modalities such as point clouds, as well as specific applications like embodied robotics. These are temporarily grouped under "Others," and we will conduct a more detailed sorting in the future.

Evaluation Results
Papers
Datasets

Papers

Single Image

Title	Venue	Date	Code	Stars	Benchmark
SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery	ARXIV	25-12	link		--
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL	ARXIV	25-12	link		--
SpatialGeo: Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion	ARXIV	25-11	link		--
R2D3:ImpartingSpatial Reasoning by Reconstructing 3D Scenes from 2D Images	ARXIV	--	--	--	R2D3
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robtics	ARXIV	25-06	link	--	RefSpatial-Bench
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces	ARXIV	25-06	link		VeBrain-600k
SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization	ARXIV	25-06	--	--	SVQA-R1
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models	--	25-06	link		OmniSpatial
Can Multimodal Large Language Models Understand Spatial Relations	ARXIV	25-05	link		SpatialMQA
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning	--	25-05	--	--	SSR-CoT
Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding	ARXIV	25-05	link	--	SUNSPOT
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?	ARXIV	25-05	link	--	OSR-Bench
SITE: towards Spatial Intelligence Thorough Evaluation	ARXIV	25-05	link		SITE
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning	ARXIV	25-05	--	--	TallyQA, V* InfographicVQA, MVBench
Improved Visual-Spatial Reasoning via R1-Zero-Like Training	ARXIV	25-04	link		VSI-100K
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation	ARXIV	25-04	link		COMFORT++, 3DSRBench
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning	ARXIV	25-04	link	--	--
Vision language models are unreliable at trivial spatial cognition	ARXIV	25-04	--	--	TableTest
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data	ARXIV	25-04	--	--	vsr, what's up 3DSR-Bench, RealWorldQA
NUSCENES-SPATIALQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving	ARXIV	25-04	link		NuScenes-SpatialQA
Beyond Semantics Rediscovering Spatial Awareness in Vision-Language Models	--	25-03	link	--	--
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse	ARXIV	25-03	link		MetaSpatial
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models	ARXIV	25-03	link		SRBench
Open3DVQA: ABenchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space	ARXIV	25-03	link		Open3DVQA
AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning	ARXIV	25-03	link		AutoSpatial
Why Is Spatial Reasoning Hard for VLMs? AnAttention Mechanism Perspective on Focus Areas	ARXIV	25-03	link		--
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning	ARXIV	25-03	link		LEGO-Puzzles
Visual Agentic AI for Spatial Reasoning with a Dynamic API	ARXIV	25-02	link		--
iVISPAR —AnInteractive Visual-Spatial Reasoning Benchmark for VLMs	ARXIV	25-02	link		iVISPAR
Visual Agentic AI for Spatial Reasoning with a Dynamic API	ARXIV	25-02	link		Q-Spatial Bench, VSI-Bench
Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics	ARXIV	25-02	--	--	BSA-Tests
Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting	NAACL	25-02	link		ARO, GQA MMRel
Do Vision-Language Models Represent Space and How. Evaluating Spatial Frame of Reference under Ambiguities	ICLR	25-01	link	--	COMFORT
ROBOSPATIAL: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics	CVPR	25-01	--	--	--
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models	CVPR	25-01	--	--	--
COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning in Multimodal Language Model	CVPR	25-01	link	--	--
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models	CVPR	25-01	link	--	--
SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language	CVPR	25-01	link		SpatialBench
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning	ARXIV	25-01	--	--	--
ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning	CVPR	25-01	link	--	ReasoningGD
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought	ARXIV	25-01	--	--	LEC23, WMS+24 LZZ+24, RDT+24
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations	ARXIV	24-12	link		SpaceSGG
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark	ARXIV	24-12	link	--	3DSRBench
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation	ACL	24-12	link		SPHERE
TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation	ARXIV	24-11	--	--	--
AnEmpirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models	EMNLP	24-11	link		Spatial-MM
ROOT: VLM-based System for Indoor Scene Understanding and Beyond	ARXIV	24-11	link		SceneVLM
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models	EMNLP 2024	24-11	link		Spatial-MM, GQA-spatial
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks	ARXIV	24-11	link		GEOBench-VLM
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning	NIPS	24-10	--	--	what's up, coco-spatial GQA-spatial
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models	ARXIV	24-09	link		Q-Spatial Bench
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?	ARXIV	24-09	link		SVAT
Understanding Depth and Height Perception in Large Visual-Language Models	CVPRW	24-08	link		GeoMeter
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model	--	24-08	--	--	ScanQA, OpenEQA’s episodic memory subset EgoSchema, R2R SQA3D
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs	ARXIV	24-07	link		VSP
SpatialBot: Precise Spatial Understanding with Vision Language Models	ICRA	24-06	link		SpatialBench
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models	ARXIV	24-06	link		SpatialRGPT-Bench
TOPVIEWRS: Vision-Language Models as Top-View Spatial Reasoners	ARXIV	24-06	link	--	--
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models	NIPS	24-06	link		SpatialEval
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models	ARXIV	24-06	link		EmbSpatial-Bench
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs	NEURIPS 2024 WORKSHOP	24-06	--	--	GSR-BENCH
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics	CORL2024	24-06	link		RoboPoint
Reframing Spatial Reasoning Evaluation in Language Models:A Real-World Simulation Benchmark for Qualitative Reasoning	ARXIV	24-05	--	--	RoomSpace, bAbI StepGame, SpartQA SpaRTUN
RAG-Guided Large Language Models for Visual Spatial Description with Adaptive Hallucination Corrector	ACMMM24	24-05	--	--	VSD
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models	NIPS	24-04	link		VoT
BLINK: Multimodal Large Language Models Can See but Not Perceive	ECCV	24-04	link		BLINK
Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning	CVPR2024	24-04	link		KITTI-360
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning	--	24-03	link		Visual-CoT
Can Transformers Capture Spatial Relations between Objects?	ICLR	24-03	link		SRP
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors	NEURIPS2024	24-03	link		NOCS, RT-1 BridgeData V2, YCBInEOAT
SpatialVLM Endowing Vision-Language Models with Spatial Reasoning Capabilities	CVPR	24-01	link		--
LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description	ACM MM	24-01	--	--	--
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis	ARXIV	24-01	link		Proximity-110K
Improving Vision-and-Language Reasoning via Spatial Relations Modeling	WACV	23-11	--	--	--
3D-Aware Visual Question Answering about Parts, Poses and Occlusions	NIPS	23-10	link		Super-CLEVR-3D
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals	ACL2022	22-03	link		--

Monocular-Video

Title	Venue	Date	Code	Stars	Benchmark
Cambrian-S: Towards Spatial Supersensing in Video	ARXIV	25-11	link		VSI-SUPER
Vision-Language Memory for Spatial Reasoning	ARXIV	25-11	link	--	--
VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning	ARXIV	25-08	link	--	VisualTrans
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models	ICCV	25-08	link	--	VLM4D
OpenEQA: Embodied Question Answering in the Era of Foundation Models	CVPR	--	link		OpenEQA
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data	ARXIV	25-06	link		--
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction	ARXIV	25-05	link		VSTiBench
3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model	ARXIV	25-05	link		3DMEM-Bench
Spatial-MLLM Boosting MLLM Capabilities in Visual-based Spatial Intelligence	ARXIV	25-05	link		Spatial-MLLM-120k
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames	ARXIV	25-05	--	--	DISJOINT-3DQA
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors	ARXIV	25-05	link		VSI-Bench
SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models	ARXIV	25-05	--	--	--
Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts	ARXIV	25-05	link		--
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning	ARXIV	25-04	link		--
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning	ARXIV	25-04	link		Embodied-R
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning	ARXIV	25-04	link		VSI-Bench, STI-Bench and SPAR-Bench
Towards Understanding Camera Motions in Any Video	ARXIV	25-04	link		CameraBench
EgoDTM:Towards 3D-Aware Egocentric Video-Language Pretraining	ARXIV	25-03	link		EgoMCQ...
STI-Bench: Are MLLMsReadyfor Precise Spatial-Temporal World Understanding?	ARXIV	25-03	link		STI-Bench
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos	ARXIV	25-03	link		Ego-ST
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D	ARXIV	25-03	link		SPAR-Bench
ST-VLM:KinematicInstruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models	ARXIV	25-03	link		STKit
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	CVPR	25-01	link		VSI-Bench
M3: 3D-Spatial Multimodal Memory	ICLR	25-01	link		--
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding	CVPR	25-01	--	--	--
Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering	ICLR	25-01	link		DynSuperCLEVR
DOES SPATIAL COGNITION EMERGE IN FRONTIER MODELS?	ICLR	24-10	link		SPACE
Explore until Confident: Efficient Exploration for Embodied Question Answering	ARXIV	24-03	link		--
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI	CVPR	24-01	--	--	--

Multi-View Images

Title	Venue	Date	Code	Stars	Benchmark
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning	ARXIV	25-11	link		--
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models	ARXIV	25-06	--	--	InternSpatial
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with MultiModal Large Language Models	ARXIV	25-05	link		--
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence	ARXIV	25-05	link		MMSI-Bench
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models	ARXIV	25-05	link		ViewSpatial-Bench
Seeing from Another Perspective Evaluating Multi-View Understanding in MLLMS	ARXIV	25-04	link		All-Angles-Bench
MM-Spatial Exploring 3D Spatial Understanding in Multimodal LLMs	ARXIV	25-03	--	--	Cubify Anything VQA
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models	CVPR	25-01	--	--	CoSpace
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models	ARXIV	24-12	--	--	--
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection	CVPR2025	24-12	--	--	CLIPort, Omnigibson RL-Bench
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings	CVPR2020	20-03	link		SPARE3D

Others

Title	Venue	Date	Code	Stars	Benchmark
Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding	ARXIV	25-11	--	--	--
Visual Spatial Tuning	ARXIV	25-11	link		--
Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks	ARXIV	25-10	link		VST
SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence	ARXIV	25-06	link		SpaCE-10
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics	ARXIV	25-06	--	--	Robot-R1 Bench
Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models	ARXIV	25-06	link		--
SpatialScore Towards Unified Evaluation for Multimodal Spatial Understanding	ARXIV	25-05	link		SpatialScore
A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision	ECCV	25-05	--	--	LVSQA
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation	ARXIV	25-05	link	--	ManipBench
InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning	ARXIV	25-05	link		--
Universal Visuo-Tactile Video Understanding for Embodied Interaction	ARXIV	25-05	--	--	--
MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents	ARXIV	25-05	link		--
LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding	ARXIV	25-05	--	--	scan2cap, scanqa scanref, multi3drefer chat4d
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes	ARXIV	25-04	--	--	--
A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science	ARXIV	25-04	--	--	--
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness	ARXIV	25-04	link		sqa3d, scanqa
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D	ARXIV	25-04	link		SR3D, NR3D ScanRefer
3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o	--	25-03	--	--	--
EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks	ARXIV	25-03	--	--	--
FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks	ARXIV	25-02	link		FoREST
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards	ICRA	25-02	link		--
pace-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired	ARXIV	25-02	link		SA-Bench
VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning	ARXIV	25-02	--	--	--
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation	ARXIV	25-02	link		--
PHYSBENCH: BENCHMARKING AND ENHANCING VISION-LANGUAGE MODELS FOR PHYSICAL WORLD UNDERSTANDING	ARXIV	25-01	link		PhysBench
3D-Mem: 3DScene Memory for Embodied Exploration and Reasoning	CVPR	25-01	link		--
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability	CVPR	25-01	--	--	--
Evaluating and enhancing spatial cognition abilities of large language models	IJGIS	25-01	link		--
EMBODIEDEVAL: Evaluate Multimodal LLMs as Embodied Agents	ARXIV	25-01	--	--	EMBODIEDEVAL
Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces	ARXIV	25-01	link	--	Social-LLaVA
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model	ARXIV	25-01	link		SpatialVLA
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints	CVPR2025	25-01	link		OmniManip
Synthetic Vision: Training Vision-Language Models to Understand Physics	--	24-12	--	--	--
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning	--	24-12	link		Emma-X
Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning	ARXIV	24-12	--	--	--
GPT-4V(ision) for Robotics: Multimodal Task Planning From Human Demonstration	OBOTICS AND AUTOMATION LETTERS	24-11	--	--	--
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities	ICLR	24-10	link	--	COMFORT
I KnowAbout“Up”! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction	ARXIV	24-07	--	--	--
GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning	ARXIV	24-07	link		--
RoboPoint: AVision-Language Model for Spatial Affordance Prediction for Robotics	ARXIV	24-06	link		Robopoint
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models	ACL	24-06	link		SpaRP
KnowYour Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning	CVPR	24-04	link		--
Agent3D-Zero: An Agent for Zero-shot 3D Understanding	ECCV	24-03	--	--	--
Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning	WACV	24-03	--	--	ScanQA, SQA3D ALFRED
ShapeLLM: Universal 3D Object Understanding for Embodied Interaction	ECCV	24-02	link		--
BAT: Learning to Reason about Spatial Sounds with Large Language Models	ARXIV	24-02	link		--
Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large Language Models	ARXIV	24-02	--	--	Euclidea
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	CVPR	24-01	link		--
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark	AAAI	24-01	--	--	StepGame
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models	CVPR	24-01	link		NuInstruct

Datasets

Title	Venue	Date	Download-Link	Citation	Input Type
Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions	ARXIV	26-01	NA	0	Image
Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions	ARXIV	26-01	NA	0	Image
Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis	ARXIV	25-12	link	0	Video,Image
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence	ARXIV	25-12	link	0	Video
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models	ARXIV	25-11	NA	1	image
Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks	ARXIV	25-11	link	0	Video
LTD-Bench: Evaluating Large Language Models by Letting Them Draw	ARXIV	25-11	link	1	Image
Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation	ARXIV	25-11	link	0	Image
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition	ARXIV	25-11	link	1	Video
Do Vision-Language Models Represent Space and How. Evaluating Spatial Frame of Reference under Ambiguities	ICLR	25-11	link	0	Image
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models	ARXIV	25-06	link	0	Image
SpatialScore Towards Unified Evaluation for Multimodal Spatial Understanding	ARXIV	25-05	link	0	Image
Can Multimodal Large Language Models Understand Spatial Relations	ARXIV	25-05	link	--	Image
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?	ARXIV	25-05	link	0	Image
SITE: towards Spatial Intelligence Thorough Evaluation	ARXIV	25-05	link	0	Image/Multi-view Image/Video
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction	ARXIV	25-05	link	0	Video
MMSI-Bench: A Benchmark for Multi-ImagecSpatial Intelligence	ARXIV	25-05	link	1	multi-view
ViewSpatial-Bench:Evaluating Multi-perspective Spatial Localization in Vision-Language Models	ARXIV	25-05	link	--	multi-view
Improved Visual-Spatial Reasoning via R1-Zero-Like Training	ARXIV	25-04	link	9	--
Towards Understanding Camera Motions in Any Video	ARXIV	25-04	link	1	Video
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs	ARXIV	25-04	link	3	multi-view
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models	ARXIV	25-03	link	5	Image
Open3DVQA: ABenchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space	OPEN3DVQA	25-03	link	4	Image
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?	ARXIV	25-03	link	8	Multi-view
STI-Bench: Are MLLMsReadyfor Precise Spatial-Temporal World Understanding?	ARXIV	25-03	--	7	Video
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos	ARXIV	25-03	link	5	Video
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D	ARXIV	25-03	link	2	Image
Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired	ARXIV	25-02	link	1	Image
VADAR: Visual Agentic AI for Spatial Reasoning with a Dynamic API	CVPR	25-02	link	5	image
PHYSBENCH: BENCHMARKING AND ENHANCING VISION-LANGUAGE MODELS FOR PHYSICAL WORLD UNDERSTANDING	ARXIV	25-01	link	22	Image/Video
Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces	ARXIV	25-01	link	2	Image
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model	ARXIV	25-01	link	--	Video
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning	ARXIV	24-12	link	--	Video
LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations	ARXIV	24-12	link	2	image
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark	ARXIV	24-12	link	12	Image
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation	ACL	24-12	link	3	Image
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	ARXIV	24-12	link	86	Video
ROOT: VLM-based System for Indoor Scene Understanding and Beyond	ARXIV	24-11	--	2	image
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models	EMNLP 2024	24-11	link	9	Image
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks	ARXIV	24-11	link	7	Image
OpenEQA: Embodied Question Answering in the Era of Foundation Models	CVPR	24-11	link	149	Video
DOES SPATIAL COGNITION EMERGE IN FRONTIER MODELS?	ICLR	24-10	link	19	Video
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models	ARXIV	24-09	link	15	Image
Understanding Depth and Height Perception of Large Visual-Language Models	CVPRW	24-08	link	0	2D/3D image
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs	ARXIV	24-07	link	4	Image
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models	ACL	24-06	link	4	text
SpatialBot: Precise Spatial Understanding with Vision Language Models	ICRA	24-06	link	43	Image
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models	ARXIV	24-06	link	104	Image, point cloud
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models	NEURIPS	24-06	link	61	text only/image only/image-text
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models	ACL 2024 SHORT	24-06	link	22	Image
COMPOSITIONAL 4D DYNAMIC SCENES UNDERSTANDING WITH PHYSICS PRIORS FOR VIDEO QUESTION ANSWERING	ICLR2025	24-06	link	5	Video
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning	IJCAI 2024	24-05	link	9	Multi-view
Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models	NEURIPS	24-04	link	33	image
BLINK: Multimodal Large Language Models Can See but Not Perceive	ECCV	24-04	link	180	Image/Multi-view Image
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning	NEURIPS	24-03	link	74	image
Can Transformers Capture Spatial Relations between Objects?	ICLR	24-03	link	6	Image
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models	CVPR	24-01	link	--	Video
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis	ARXIV	24-01	link	2	Image
3D-Aware Visual Question Answering about Parts, Poses and Occlusions	NIPS	23-10	link	14	Image
Sqa3d: Situated question answering in 3d scenes	ICLR	22-10	link	161	point cloud
ScanQA: 3D Question Answering for Spatial Scene Understanding	CVPR	21-12	link	234	point cloud
SPARE3D: A Dataset for SPAtial REasoning on Three-View Line Drawings	CVPR2020	20-03	link	23	multi-view