
Overview of our Spatial Perception And Reasoning (SPAR) dataset and benchmark. Our dataset is sourced from 4,500 scenes and comprises 33 spatial tasks spanning single-view, multi-view, and video settings. Our benchmark includes over 7,000 carefully curated high-quality samples to comprehensively evaluate the spatial perception and understanding capabilities of existing models.
SPAR-7M is a large-scale vision-language dataset designed to study spatial perception and reasoning in complex 3D scenes. Built upon a novel 2D data generation pipeline, it translates 3D ground-truth from richly annotated scenes into diverse, scalable spatial QA pairs. The dataset spans 33 task types, ranging from basic perception (e.g., depth, distance) to complex reasoning (e.g., spatial imagination, object relation inference), and supports single-view, multi-view, and video-based formats.
Unlike prior datasets, SPAR-7M focuses on spatial diversity and compositionality. It enables systematic evaluation across object-object and object-camera relations, and offers fine-grained control over QA type, view configuration, and cognitive levels.
SPAR-7M covers a wide range of spatial perception and understanding abilities, organized along multiple dimensions:
-
Cognitive Level
- Low-level (Perception): Depth estimation, distance prediction, object location, etc.
- Medium-level (P-2-R): View change inference, object matching, etc.
- High-level (Reasoning): Spatial imagination, navigation, multi-view relation inference, etc.
-
Spatial Relation Type
- ObjectβObject (OO): Inferring spatial relationships between objects.
- ObjectβCamera (OC): Estimating object properties relative to the camera (e.g., position, distance, direction).
-
Input Modality
- Single-view: Tasks using one image as input.
- Multi-view: Tasks requiring reasoning across 3β5 images.
- Video: Tasks derived from temporally coherent RGB sequences.
Each QA pair is grounded in precise 3D geometry, enabling reliable evaluation and training for spatial tasks.
Each QA sample consists of:
{
"id": "scene0261_00_16",
"conversations":
[{
"from": "human",
"value": "With the counter (red point) having a depth of 1.6 meters, determine the depth of towel (blue point) in the same frame. Calculate or judge based on the 3D center points of these objects. The unit is meter."},
{
"from": "gpt",
"value": "towel's central depth is estimated to be about 1.5 meters."}],
"image": ["scene0261_00/image_color/543.jpg"],
"type": "depth_prediction_oc",
"depth": ["scene0261_00/image_depth/543.png"],
"red_point": [[553, 397]],
"blue_point": [[641, 838]]
}
We also provide metadata for all images, including:
- Camera intrinsics and extrinsics
- Depths
We provide two versions of the SPAR-7M dataset:
Version | Description |
---|---|
SPAR-7M |
Clean and compact version, includes images, questions, answers, and labels. |
SPAR-7M-RGBD |
Full version with additional depths, camera intrinsics, and extrinsics. Ideal for 3D-aware training. |
You can download both versions from Hugging Face:
# Download SPAR-7M (default)
huggingface-cli download jasonzhango/SPAR-7M --repo-type dataset
# Download SPAR-7M-RGBD (with depth and camera parameters)
huggingface-cli download jasonzhango/SPAR-7M-RGBD --repo-type dataset
These datasets are split into multiple .tar.gz parts due to Hugging Face file size limits. After downloading all parts, run the following to extract:
# For SPAR-7M
cat spar-*.tar.gz | tar -xvzf -
# For SPAR-7M-RGBD
cat spar-rgbd-*.tar.gz | tar -xvzf -
Alternatively, if Hugging Face is not accessible, you can use the provided script:
wget https://hf-mirror.com/hfd/hfd.sh
chmod a+x hfd.sh
export HF_ENDPOINT=https://hf-mirror.com
./hfd.sh jasonzhango/SPAR-7M --dataset
./hfd.sh jasonzhango/SPAR-7M-RGBD --dataset
The dataset directory structure is:
spar/
βββ rxr/
βββ scannet/
β βββ images/
β | βββ scene0000_00/
β | βββ image_color/
β | βββ video_color/
β | βββ image_depth/ # only in SPAR-7M-RGBD
β | βββ video_depth/ # only in SPAR-7M-RGBD
β | βββ pose/ # only in SPAR-7M-RGBD
β | βββ video_pose/ # only in SPAR-7M-RGBD
β | βββ intrinsic/ # only in SPAR-7M-RGBD
β | βββ video_idx.txt
β βββ qa_jsonl/
β βββ train/
β | βββ depth_prediction_oo/
β | | βββ fill/
β | | | βββ fill_76837.jsonl
β | | βββ select/
β | | βββ sentence/
β | βββ obj_spatial_relation_oc/
β | βββ spatial_imagination_oo_mv/
β βββ val/
βββ scannetpp/
βββ structured3d/
Each QA task (e.g., depth_prediction_oc
, spatial_relation_oo_mv
, etc.) is organized by task type, with subfolders for different answer formats:
fill/
β numerical or descriptive answersselect/
β multiple choicesentence/
β natural language answers
To train models on SPAR-7M or SPAR-7M-RGBD, we first convert raw .jsonl
QA annotations into training index files in the InternVL-style data_json
format.
We provide a script to automate this:
ln -s path-to-spar-7m ./
python datasets/generate_data_json.py
This script will:
- Recursively scan all
*.jsonl
files under thespar/
directory - Convert them into structured data_json entries
- Save the output files to the
data_jsons/
folder
By default, the script processes four sub-datasets:
if __name__ == "__main__":
dataset_list = [
"rxr",
"scannet",
"scannetpp",
"structured3d",
]
for dataset in dataset_list:
process_dataset(dataset)
You will find the generated training index files here:
data_jsons/
βββ scannet_7799k.json # Index for all SPAR-7M QA from ScanNet scenes
βββ scannetpp_5941k.json # Index for ScanNet++ scenes
βββ ...
Once you've generated individual data_json
files, you can use the provided script to mix them with customized ratios, both per-dataset and per QA type.
Run the script:
ln -s path-to-spar-7m ./
python datasets/mix_data.py
This script supports two types of mixing control:
You can control the contribution of each sub-dataset using:
mix_ratios = {
"rxr_11k": 1.0,
"scannet_7799k": 1.0,
"scannetpp_5941k": 1.0,
"structured3d_2523k": 1.0,
# "2d_data": 0.5, # You can also add external 2D datasets
}
You can also balance different answer types to emphasize full-sentence reasoning or suppress multiple-choice overfitting:
qa_type_ratios = {
"sentence": 1.0,
"select": 0.1,
"fill": 0.1,
"judge": 0.1,
}
The script will output a mixed data_json file for training:
data_jsons/
βββ 1rxr_1scannet_1scannetpp_1structured3d_7m.json # The final mixed dataset index
We provide a lightweight demo script to help you understand how to load and visualize the QA data in data_json
format.
python datasets/toy_dataset.py --json_dir data_jsons/toy.json
This script will:
- Load QA samples from a data_json file
- Parse image paths and annotation info (e.g., bounding boxes, points, text)
- Call functions from
datasets/draw_marker.py
to render visual markers
π‘ Tip: You can modify
toy_dataset.py
to iterate over your full training set or to save the visualization outputs to disk.
SPAR-Bench is a high-quality benchmark for evaluating the spatial perception and understanding capability of Vision-Language Models (VLMs), built on top of SPAR-7M with human verification. We initially sampled 8,000 questions from the validation set (400 per task), and after a thorough manual filtering process to remove ambiguous or problematic cases, the final benchmark contains 7,207 high-quality QA pairs.
SPAR-Bench contains 20 representative spatial tasks:
Level | Description | Tasks |
---|---|---|
πΉ Low-level Spatial Perception | Basic perception of spatial properties such as depth and distance | Depth_OC , Depth_OC_MV , Depth_OO , Depth_OO_MV Dist_OC , Dist_OC_MV , Dist_OO , Dist_OO_MV |
πΈ Medium-level Cross-view Perception | View-based alignment, motion inference, and position matching | PosMatch , CamMotion , ViewChg |
πΊ High-level Spatial Reasoning | Multi-object reasoning and imagination across multiple views | DistI_OO , DistI_OO_MV ObjRel_OC_MV , ObjRel_OO , ObjRel_OO_MV SpImag_OC , SpImag_OC_MV , SpImag_OO , SpImag_OO_MV |
We evaluate a wide range of models on SPAR-Bench, including commercial APIs and open-source vision-language models.
To ensure fair comparison, we exclude models that are fine-tuned on SPAR-7M (e.g., InternVL2.5-8B + SPAR-mix).
Under this setting:
- π¨ Human Level: 67.27
- π₯ Best API model: Qwen2.5-VL-72B β 39.40
- π₯ Best open-source model (<8B): InternVL2.5-8B β 36.28
Method | Avg. | Low | Medium | High |
---|---|---|---|---|
π€ Baselines (eval on tiny) | ||||
Random | 32.74 | 31.19 | 38.25 | 32.29 |
Human | 67.27 | 55.31 | 72.32 | 76.22 |
π¦ API Models (eval on tiny) | ||||
GPT-4o | 36.39 | 29.25 | 24.93 | 45.11 |
Claude-3.7-Sonnet | 21.77 | 25.43 | 7.33 | 23.33 |
Qwen2-VL-72B | 35.62 | 35.28 | 23.39 | 40.00 |
Qwen2.5-VL-72B | 39.40 | 35.35 | 23.05 | 48.44 |
π¨ Open-source Models (<8B) (eval on full) | ||||
InternVL2-8B | 33.02 | 26.83 | 36.49 | 37.47 |
InternVL2.5-8B | 36.28 | 29.46 | 31.88 | 43.80 |
LLaVA-OV-7B | 31.20 | 21.79 | 26.13 | 40.14 |
Qwen2-VL-7b | 30.74 | 27.52 | 20.44 | 37.03 |
Qwen2.5-VL-7b | 33.07 | 28.75 | 22.97 | 40.27 |
LLaVA-v1.5-7b | 23.65 | 10.85 | 27.50 | 34.09 |
LLaVA-v1.6-7b | 13.21 | 8.53 | 4.79 | 20.18 |
π₯ Fine-tuned (SPAR-mix) | ||||
InternVL2.5-8B + SPAR-mix | 63.25 | 65.53 | 63.01 | 60.19 |
β οΈ We typically exclude fine-tuned models (like InternVL2.5-8B + SPAR-mix) from direct comparison, as they are trained on SPAR-7M and thus not evaluated in a zero-shot setting.
π Note:
Avg.
is the mean accuracy across all 20 tasks in SPAR-Bench.Low
,Medium
, andHigh
are means over their respective subsets of tasks (which differ in count).- Therefore,
Avg.
β average of Low/Medium/High.- Only a subset of models and results are shown here β see our paper for full per-task breakdowns.
We provide tools and instructions to evaluate your own models on SPAR-Bench using lmms-eval.
There are four versions of SPAR-Bench available on Hugging Face:
Version | Description |
---|---|
SPAR-Bench |
Full benchmark with 7,207 QA samples across 20 tasks |
SPAR-Bench-Tiny |
50 QA per task (1,000 total); suitable for API or human evaluation |
SPAR-Bench-RGBD |
Full version with depth & camera pose info (for 3D-aware models) |
SPAR-Bench-Tiny-RGBD |
Tiny + RGBD version |
You can download with:
huggingface-cli download jasonzhango/SPAR-Bench --repo-type dataset
huggingface-cli download jasonzhango/SPAR-Bench-Tiny --repo-type dataset
# Or use: ./hfd.sh jasonzhango/SPAR-Bench --dataset
Create a clean conda environment and install the necessary dependencies:
conda create -n SPAR python=3.10
conda activate SPAR
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install setuptools==57.5.0
pip install flash-attn --no-build-isolation
git clone https://github.com/xx/spar.git
cd lmms-eval
pip install -e .
You can evaluate your model on SPAR-Bench using the provided scripts:
# For Tiny version (recommended for quick testing)
bash eval_sparbench_tiny.sh
# For Full benchmark
bash eval_sparbench.sh
A sample script (eval_sparbench.sh) looks like this, You can modify these arguments to fit your own model path, task name, or logging settings:
export MODEL=internvl2
export TASK=sparbench
export SUFFIX=internvl2_full
export PRETRAIN=path-to-model/InternVL2-4B
accelerate launch --main_process_port 51123 --num_processes 8 -m lmms_eval \
--model ${MODEL} --tasks ${TASK} --batch_size 1 --log_samples --log_samples_suffix ${SUFFIX} \
--output_path ./logs/ \
--model_args pretrained=${PRETRAIN},dtype=bf16,attn_implementation=flash-attn
If you cannot access Hugging Face, you can still run SPAR-Bench by modifying the dataset path manually.
Open the benchmark config file:
vim lmms-eval/lmms_eval/tasks/sparbench/sparbench.yaml
Replace the default dataset path:
dataset_path: jasonzhango/SPAR-Bench
With your local dataset path, e.g.:
dataset_path: /cache/your_path/SPAR-Bench
While SPAR-Bench has undergone extensive manual filtering to ensure question quality and clarity, it may still contain occasional ambiguities, edge cases, or annotation issues. We welcome feedback from the community β if you spot any mistakes or unclear samples, feel free to open an issue or pull request.
Additionally, if you find any part of the codebase unclear or hard to use, please let us know. We are committed to continuously improving both the benchmark and its usability.
We would like to thank the following projects for their contributions and inspiration:
- lmms-eval: used as the evaluation framework for SPAR-Bench.
- thinking-in-space: inspired the design of spatial tasks and benchmark formulation.
Their work laid important groundwork for evaluating spatial reasoning in vision-language models.
If you find this project or dataset helpful, please consider citing our paper:
@article{zhang2025from,
title={From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D},
author={Zhang, Jiahui and Chen, Yurui and Zhou, Yanpeng and Xu, Yueming and Huang, Ze and Mei, Jilin and Chen, Junhui and Yuan, Yujie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li},
year={2025},
journal={arXiv preprint arXiv:2503.22976},
}