Skip to content

fudan-zvg/spar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

From Flatland to Space:
Teaching Vision-Language Models to Perceive and Reason in 3D

arXiv Website
HF Dataset: SPAR-7M HF Dataset: SPAR-Bench
Jiahui Zhang1*, Yurui Chen1*, Yanpeng Zhou2*, Yueming Xu1, Ze Huang1, Jilin Mei1, Junhui Chen1, Yu-Jie Yuan2, Xinyue Cai2, Guowei Huang2, Xingyue Quan2, Hang Xu2, Li Zhang1
1Fudan University  2Huawei Noah’s Ark Lab 

Overview of our Spatial Perception And Reasoning (SPAR) dataset and benchmark. Our dataset is sourced from 4,500 scenes and comprises 33 spatial tasks spanning single-view, multi-view, and video settings. Our benchmark includes over 7,000 carefully curated high-quality samples to comprehensively evaluate the spatial perception and understanding capabilities of existing models.

Contents

πŸ“¦ SPAR-7M

πŸ“Œ Dataset Summary

SPAR-7M is a large-scale vision-language dataset designed to study spatial perception and reasoning in complex 3D scenes. Built upon a novel 2D data generation pipeline, it translates 3D ground-truth from richly annotated scenes into diverse, scalable spatial QA pairs. The dataset spans 33 task types, ranging from basic perception (e.g., depth, distance) to complex reasoning (e.g., spatial imagination, object relation inference), and supports single-view, multi-view, and video-based formats.

Unlike prior datasets, SPAR-7M focuses on spatial diversity and compositionality. It enables systematic evaluation across object-object and object-camera relations, and offers fine-grained control over QA type, view configuration, and cognitive levels.

πŸͺ„ Task Types

SPAR-7M covers a wide range of spatial perception and understanding abilities, organized along multiple dimensions:

  • Cognitive Level

    • Low-level (Perception): Depth estimation, distance prediction, object location, etc.
    • Medium-level (P-2-R): View change inference, object matching, etc.
    • High-level (Reasoning): Spatial imagination, navigation, multi-view relation inference, etc.
  • Spatial Relation Type

    • Object–Object (OO): Inferring spatial relationships between objects.
    • Object–Camera (OC): Estimating object properties relative to the camera (e.g., position, distance, direction).
  • Input Modality

    • Single-view: Tasks using one image as input.
    • Multi-view: Tasks requiring reasoning across 3–5 images.
    • Video: Tasks derived from temporally coherent RGB sequences.

Each QA pair is grounded in precise 3D geometry, enabling reliable evaluation and training for spatial tasks.

πŸ“„ Data Format and Examples

Each QA sample consists of:

{
    "id": "scene0261_00_16", 
    "conversations": 
        [{
            "from": "human", 
            "value": "With the counter (red point) having a depth of 1.6 meters, determine the depth of towel (blue point) in the same frame.  Calculate or judge based on the 3D center points of these objects. The unit is meter."},
         {
            "from": "gpt", 
            "value": "towel's central depth is estimated to be about 1.5 meters."}], 
    "image": ["scene0261_00/image_color/543.jpg"], 
    "type": "depth_prediction_oc", 
    "depth": ["scene0261_00/image_depth/543.png"],
    "red_point": [[553, 397]], 
    "blue_point": [[641, 838]]
}

We also provide metadata for all images, including:

  • Camera intrinsics and extrinsics
  • Depths

πŸ“₯ Download

We provide two versions of the SPAR-7M dataset:

Version Description
SPAR-7M Clean and compact version, includes images, questions, answers, and labels.
SPAR-7M-RGBD Full version with additional depths, camera intrinsics, and extrinsics. Ideal for 3D-aware training.

You can download both versions from Hugging Face:

# Download SPAR-7M (default)
huggingface-cli download jasonzhango/SPAR-7M --repo-type dataset

# Download SPAR-7M-RGBD (with depth and camera parameters)
huggingface-cli download jasonzhango/SPAR-7M-RGBD --repo-type dataset

These datasets are split into multiple .tar.gz parts due to Hugging Face file size limits. After downloading all parts, run the following to extract:

# For SPAR-7M
cat spar-*.tar.gz | tar -xvzf -

# For SPAR-7M-RGBD
cat spar-rgbd-*.tar.gz | tar -xvzf -

Alternatively, if Hugging Face is not accessible, you can use the provided script:

wget https://hf-mirror.com/hfd/hfd.sh

chmod a+x hfd.sh

export HF_ENDPOINT=https://hf-mirror.com

./hfd.sh jasonzhango/SPAR-7M --dataset
./hfd.sh jasonzhango/SPAR-7M-RGBD --dataset

The dataset directory structure is:

spar/
β”œβ”€β”€ rxr/
β”œβ”€β”€ scannet/
β”‚   β”œβ”€β”€ images/
β”‚   |   └── scene0000_00/
β”‚   |       β”œβ”€β”€ image_color/
β”‚   |       β”œβ”€β”€ video_color/
β”‚   |       β”œβ”€β”€ image_depth/           # only in SPAR-7M-RGBD
β”‚   |       β”œβ”€β”€ video_depth/           # only in SPAR-7M-RGBD
β”‚   |       β”œβ”€β”€ pose/                  # only in SPAR-7M-RGBD
β”‚   |       β”œβ”€β”€ video_pose/            # only in SPAR-7M-RGBD
β”‚   |       β”œβ”€β”€ intrinsic/             # only in SPAR-7M-RGBD
β”‚   |       └── video_idx.txt
β”‚   └── qa_jsonl/
β”‚       β”œβ”€β”€ train/
β”‚       |   β”œβ”€β”€ depth_prediction_oo/
β”‚       |   |   β”œβ”€β”€ fill/
β”‚       |   |   |   └── fill_76837.jsonl
β”‚       |   |   β”œβ”€β”€ select/
β”‚       |   |   └── sentence/
β”‚       |   β”œβ”€β”€ obj_spatial_relation_oc/
β”‚       |   └── spatial_imagination_oo_mv/
β”‚       └── val/
β”œβ”€β”€ scannetpp/
└── structured3d/

Each QA task (e.g., depth_prediction_oc, spatial_relation_oo_mv, etc.) is organized by task type, with subfolders for different answer formats:

  • fill/ β€” numerical or descriptive answers
  • select/ β€” multiple choice
  • sentence/ β€” natural language answers

πŸ› οΈ Generate Training Index Files

To train models on SPAR-7M or SPAR-7M-RGBD, we first convert raw .jsonl QA annotations into training index files in the InternVL-style data_json format.

We provide a script to automate this:

ln -s path-to-spar-7m ./
python datasets/generate_data_json.py

This script will:

  • Recursively scan all *.jsonl files under the spar/ directory
  • Convert them into structured data_json entries
  • Save the output files to the data_jsons/ folder

By default, the script processes four sub-datasets:

if __name__ == "__main__":
    dataset_list = [
        "rxr",
        "scannet",
        "scannetpp",
        "structured3d",
    ]
    for dataset in dataset_list:
        process_dataset(dataset)

You will find the generated training index files here:

data_jsons/
β”œβ”€β”€ scannet_7799k.json       # Index for all SPAR-7M QA from ScanNet scenes
β”œβ”€β”€ scannetpp_5941k.json     # Index for ScanNet++ scenes
β”œβ”€β”€ ...

πŸ”€ Mix Data for Pretraining

Once you've generated individual data_json files, you can use the provided script to mix them with customized ratios, both per-dataset and per QA type.

Run the script:

ln -s path-to-spar-7m ./
python datasets/mix_data.py

This script supports two types of mixing control:

πŸ“Š Dataset Mixing Ratio

You can control the contribution of each sub-dataset using:

mix_ratios = {
    "rxr_11k": 1.0,
    "scannet_7799k": 1.0,
    "scannetpp_5941k": 1.0,
    "structured3d_2523k": 1.0,
    # "2d_data": 0.5,  # You can also add external 2D datasets
}

πŸŽ›οΈ QA Type Ratio

You can also balance different answer types to emphasize full-sentence reasoning or suppress multiple-choice overfitting:

qa_type_ratios = {
    "sentence": 1.0,
    "select": 0.1,
    "fill": 0.1,
    "judge": 0.1,
}

The script will output a mixed data_json file for training:

data_jsons/
β”œβ”€β”€ 1rxr_1scannet_1scannetpp_1structured3d_7m.json  # The final mixed dataset index

πŸ§ͺ Visualize QA Samples

We provide a lightweight demo script to help you understand how to load and visualize the QA data in data_json format.

python datasets/toy_dataset.py --json_dir data_jsons/toy.json

This script will:

  • Load QA samples from a data_json file
  • Parse image paths and annotation info (e.g., bounding boxes, points, text)
  • Call functions from datasets/draw_marker.py to render visual markers

πŸ’‘ Tip: You can modify toy_dataset.py to iterate over your full training set or to save the visualization outputs to disk.

🎯 SPAR-Bench

SPAR-Bench is a high-quality benchmark for evaluating the spatial perception and understanding capability of Vision-Language Models (VLMs), built on top of SPAR-7M with human verification. We initially sampled 8,000 questions from the validation set (400 per task), and after a thorough manual filtering process to remove ambiguous or problematic cases, the final benchmark contains 7,207 high-quality QA pairs.

🧩 Task List & Cognitive Levels

SPAR-Bench contains 20 representative spatial tasks:

Level Description Tasks
πŸ”Ή Low-level Spatial Perception Basic perception of spatial properties such as depth and distance Depth_OC, Depth_OC_MV, Depth_OO, Depth_OO_MV
Dist_OC, Dist_OC_MV, Dist_OO, Dist_OO_MV
πŸ”Έ Medium-level Cross-view Perception View-based alignment, motion inference, and position matching PosMatch, CamMotion, ViewChg
πŸ”Ί High-level Spatial Reasoning Multi-object reasoning and imagination across multiple views DistI_OO, DistI_OO_MV
ObjRel_OC_MV, ObjRel_OO, ObjRel_OO_MV
SpImag_OC, SpImag_OC_MV, SpImag_OO, SpImag_OO_MV

πŸ“ Evaluation Results

We evaluate a wide range of models on SPAR-Bench, including commercial APIs and open-source vision-language models.

To ensure fair comparison, we exclude models that are fine-tuned on SPAR-7M (e.g., InternVL2.5-8B + SPAR-mix).
Under this setting:

  • πŸ‘¨ Human Level: 67.27
  • πŸ₯‡ Best API model: Qwen2.5-VL-72B β€” 39.40
  • πŸ₯‡ Best open-source model (<8B): InternVL2.5-8B β€” 36.28

πŸ”’ Performance Summary (Average Accuracy % by Level)

Method Avg. Low Medium High
🟀 Baselines (eval on tiny)
Random 32.74 31.19 38.25 32.29
Human 67.27 55.31 72.32 76.22
🟦 API Models (eval on tiny)
GPT-4o 36.39 29.25 24.93 45.11
Claude-3.7-Sonnet 21.77 25.43 7.33 23.33
Qwen2-VL-72B 35.62 35.28 23.39 40.00
Qwen2.5-VL-72B 39.40 35.35 23.05 48.44
🟨 Open-source Models (<8B) (eval on full)
InternVL2-8B 33.02 26.83 36.49 37.47
InternVL2.5-8B 36.28 29.46 31.88 43.80
LLaVA-OV-7B 31.20 21.79 26.13 40.14
Qwen2-VL-7b 30.74 27.52 20.44 37.03
Qwen2.5-VL-7b 33.07 28.75 22.97 40.27
LLaVA-v1.5-7b 23.65 10.85 27.50 34.09
LLaVA-v1.6-7b 13.21 8.53 4.79 20.18
πŸŸ₯ Fine-tuned (SPAR-mix)
InternVL2.5-8B + SPAR-mix 63.25 65.53 63.01 60.19

⚠️ We typically exclude fine-tuned models (like InternVL2.5-8B + SPAR-mix) from direct comparison, as they are trained on SPAR-7M and thus not evaluated in a zero-shot setting.

πŸ“Œ Note:

  • Avg. is the mean accuracy across all 20 tasks in SPAR-Bench.
  • Low, Medium, and High are means over their respective subsets of tasks (which differ in count).
  • Therefore, Avg. β‰  average of Low/Medium/High.
  • Only a subset of models and results are shown here β€” see our paper for full per-task breakdowns.

πŸ•ΉοΈ Run Your Own Evaluation

We provide tools and instructions to evaluate your own models on SPAR-Bench using lmms-eval.

πŸ“₯ Download Benchmark

There are four versions of SPAR-Bench available on Hugging Face:

Version Description
SPAR-Bench Full benchmark with 7,207 QA samples across 20 tasks
SPAR-Bench-Tiny 50 QA per task (1,000 total); suitable for API or human evaluation
SPAR-Bench-RGBD Full version with depth & camera pose info (for 3D-aware models)
SPAR-Bench-Tiny-RGBD Tiny + RGBD version

You can download with:

huggingface-cli download jasonzhango/SPAR-Bench --repo-type dataset
huggingface-cli download jasonzhango/SPAR-Bench-Tiny --repo-type dataset
# Or use: ./hfd.sh jasonzhango/SPAR-Bench --dataset

βš™οΈ Set Up Environment

Create a clean conda environment and install the necessary dependencies:

conda create -n SPAR python=3.10
conda activate SPAR

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install setuptools==57.5.0 
pip install flash-attn --no-build-isolation

git clone https://github.com/xx/spar.git
cd lmms-eval
pip install -e .

πŸš€ Run Evaluation

You can evaluate your model on SPAR-Bench using the provided scripts:

# For Tiny version (recommended for quick testing)
bash eval_sparbench_tiny.sh

# For Full benchmark
bash eval_sparbench.sh

A sample script (eval_sparbench.sh) looks like this, You can modify these arguments to fit your own model path, task name, or logging settings:

export MODEL=internvl2
export TASK=sparbench
export SUFFIX=internvl2_full
export PRETRAIN=path-to-model/InternVL2-4B

accelerate launch --main_process_port 51123 --num_processes 8 -m lmms_eval \
  --model ${MODEL} --tasks ${TASK} --batch_size 1 --log_samples --log_samples_suffix ${SUFFIX} \
  --output_path ./logs/ \
  --model_args pretrained=${PRETRAIN},dtype=bf16,attn_implementation=flash-attn

If you cannot access Hugging Face, you can still run SPAR-Bench by modifying the dataset path manually.

Open the benchmark config file:

vim lmms-eval/lmms_eval/tasks/sparbench/sparbench.yaml

Replace the default dataset path:

dataset_path: jasonzhango/SPAR-Bench

With your local dataset path, e.g.:

dataset_path: /cache/your_path/SPAR-Bench

⚠️ Limitations

While SPAR-Bench has undergone extensive manual filtering to ensure question quality and clarity, it may still contain occasional ambiguities, edge cases, or annotation issues. We welcome feedback from the community β€” if you spot any mistakes or unclear samples, feel free to open an issue or pull request.

Additionally, if you find any part of the codebase unclear or hard to use, please let us know. We are committed to continuously improving both the benchmark and its usability.

πŸ™ Acknowledgement

We would like to thank the following projects for their contributions and inspiration:

  • lmms-eval: used as the evaluation framework for SPAR-Bench.
  • thinking-in-space: inspired the design of spatial tasks and benchmark formulation.

Their work laid important groundwork for evaluating spatial reasoning in vision-language models.

πŸ“š Bibtex

If you find this project or dataset helpful, please consider citing our paper:

@article{zhang2025from,
    title={From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D},
    author={Zhang, Jiahui and Chen, Yurui and Zhou, Yanpeng and Xu, Yueming and Huang, Ze and Mei, Jilin and Chen, Junhui and Yuan, Yujie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li},
    year={2025},
    journal={arXiv preprint arXiv:2503.22976},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published