Skip to content

Latest commit



199 lines (156 loc) · 7.07 KB

File metadata and controls

199 lines (156 loc) · 7.07 KB

Visual Queries 3D localization

teaser figure

Installation instructions

The code has been tested with Ubuntu > 16.04 and Python 3.7

  1. Clone the repositiory from here.

    git clone
    cd episodic-memory/VQ3D
  2. Load the git submodules

    git submodule init
    git submodule update
  3. Create a conda environment.

    conda create -n ego4d_vq3d python=3.7
  4. Install the requirements using pip:

    pip install -r requirements.txt
  5. Install COLMAP following these intructions. Don't forget to add the path to colmap into your PATH environment variable.


  1. Download the data using the client. Place the data into the ./data folder.

    python -m ego4d.cli.cli --output_directory ~/episodic-memory/VQ3D/data/ --dataset full_scale --universities unict
    python -m ego4d.cli.cli --output_directory ~/episodic-memory/VQ3D/data/ --dataset annotations
    python -m ego4d.cli.cli --output_directory ~/episodic-memory/VQ3D/data/ --dataset 3d
  2. [UPDATE] construct the 5fps clips using the VQ2D tools. (clips downloaded directly from the client are at 30fps. In the VQ tasks we use 5fps.)

  3. Generate the <split>_annot.json following the instruction under the VQ2D folder - step 2 from the “Running experiments” section. Place the files under ./data/


Camera pose estimation

cd camera_pose_estimation/
  1. Compute the camera intrinsics.
  • Extract all video frames.

    python \
        --input_dir data/v1/full_scale/ \
        --output_dir data/v1/videos_frames/ \
        --clips_json data/v1/3d/all_videos_for_vq3d_v1.json \
        --j 8 # number of parallel processes. Increase for spead.
  • Pre-select frames for COLMAP

    python \
        --input_dir data/v1/videos_frames/ \
        --output_dir data/v1/videos_frames_for_sfm/ \
        --j 8 # number of parallel processes. Increase for spead.
  • Run COLMAP on the pre-selected frames

    python \
        --input_dir data/v1/videos_frames_for_sfm/ \
        --output_dir data/v1/videos_sfm/
  • For videos where sfm fails we run a greedy version of the reconstruction where we select 100 frames in the middle of the video.

    python \
        --input_dir data/v1/videos_frames/ \
        --sfm_input_dir data/v1/videos_frames_for_sfm_greedy/ \
        --output_dir data/v1/videos_sfm_greedy/
  • Get intrinsics for each clips.

    python \
        --input_dir data/v1/videos_sfm/ \
        --input_dir_greedy data/v1/videos_sfm_greedy/ \
        --annotation_dir_greedy data/v1/annotations/ \
        --output_filename data/v1/scan_to_intrinsics.json
  • Note: To help reproducibility camera intrinsics have been added here: data/scan_to_intrinsics.json

  1. Compute the camera poses
  • Extract all clips frames.

    python \
        --input_dir data/v1/clips/ \
        --output_dir data/v1/clips_frames/ \
        --clips_json data/v1/3d/all_clips_for_vq3d_v1.json \
        --j 8 # number of parallel processes. Increase for spead.
  • Get the camera poses for all frames of all clips

        --input_dir data/v1/clips_frames/ \
        --query_filename data/v1/annotations/vq3d_val.json \
        --camera_intrinsics_filename data/v1/scan_to_intrinsics.json \
        --scans_keypoints_dir data/v1/3d/scans_keypoints/ \
        --scans_dir data/v1/3d/scans/ \
        --output_dir data/v1/clips_camera_poses/ \
  • Note-1: Camera pose estimation results on the val set have been included for reference here: data/all_clips_camera_poses_val.json

  • Note-2: To help reproducibility we are also providing the data for all the intermediate steps to compute the poses for one clip in the val set. You can download that information here

Depth estimation

cd depth_estimation/
  1. Prepare RGB inputs

    python \
        --input_dir data/v1/clips_camera_poses/ \
        --vq2d_results data/vq2d_results/siam_rcnn_residual_kys_val.json \
        --vq2d_annot data/v1/annotations/val_annot.json \
        --vq3d_queries data/v1/annotations/vq3d_val.json \
        --vq2d_queries data/v1/annotations/vq_val.json
  2. Compute Depths

    python \
        --input_dir data/v1/clips_camera_poses/


cd VQ3D/
  1. Compute Ground-Truth vector in query frame coordinate system for queries with pose estimated.

    python scripts/ \
        --input_dir data/v1/clips_camera_poses/ \
        --vq3d_queries data/v1/annotations/vq3d_val.json \
        --output_fileaname data/v1/annotations/vq3d_val_wgt.json \
        --vq2d_queries data/v1/vq_val.json
  2. Compute 3D vector predictions

    python scripts/ \
        --input_dir data/v1/clips_camera_poses/ \
        --output_fileaname data/vq3d_results/siam_rcnn_residual_kys_val.json \
        --vq2d_results data/vq2d_results/siam_rcnn_residual_kys_val.json \
        --vq2d_annot data/v1/annotations/val_annot.json \
        --vq2d_queries data/v1/annotations/vq_val.json \
        --vq3d_queries data/v1/annotations/vq3d_val_wgt.json
  3. Run evaluation

    python scripts/ \
        --vq3d_results data/vq3d_results/siam_rcnn_residual_kys_val.json
  • Note: To help reproducibility we provide the results for the val set here: data/vq3d_results/siam_rcnn_residual_kys_val.json

[UPDATES] Notes about the Challenge

  1. Please use the vq3d_test_unannotated_template.json file (under data/) for the challenge. These queries are the same as the ones downloaded using the ego4d-client, we have added the 'annotation_uid' entry to help match VQ3D and VQ2D queries.

  2. To find the corresponding VQ2D queries you should use the 'annotation_uid' entry found in each VQ3D query. We have also added a mapping_vq2d_to_vq3d_queries_annotations_test.json to help find the corresponding queries. Please refer to these lines to understand how to use it.

  3. To create a submission for the challenge you should use the vq3d_test_unannotated_template.json file and add the required information directly in it.