Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Performance without Concatenating Original 2D Features #14

Open
haolinyang-hlyang opened this issue Jan 3, 2025 · 1 comment

Comments

@haolinyang-hlyang
Copy link

I would like to thank you for your excellent work in this paper. I’ve been following your approach with great interest, and I have a question regarding the feature concatenation strategy mentioned in Section 4.6.

You explain that the original 2D features are concatenated with the fine-tuned features to preserve the generalization ability of the original 2D feature extractor while incorporating the 3D awareness of the fine-tuned features.

I’m curious about the impact on performance if, instead of concatenating the original 2D features, one were to use only the fine-tuned features directly (without any assembly strategy). Specifically, I would like to know how this would affect the performance on both within-domain and out-of-domain evaluation.

Any insights would be greatly appreciated!

@ywyue
Copy link
Owner

ywyue commented Jan 17, 2025

Hi @haolinyang-hlyang, thank you for your interest! First of all, I am sorry for the late reply due to the holidays and other stuff.

This is an interesting question. I didn't conduct extensive evaluations on this aspect so I currently don't have a formal answer to that. However, a recent paper "DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes" (https://arxiv.org/abs/2411.11921) uses our fine-tuned DINOv2 on ScanNet++ directly (without concatenation) and found it performs better than the original DINOv2 even in the outdoor driving domain. Relevant description can be found in their section 5.4. Ablation Studies and Feature Extractor in 7. Implementation Details. However, in this paper, those featured are used for motion mask extraction, which is different with the tasks (semantic segmentation and depth estimation) we originally considered.

I hope the above explanation may offer some insights - but to answer your question formally, I need to find some time to conduct this evaluation on semantic segmentation and depth estimation and update later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants