You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to thank you for your excellent work in this paper. I’ve been following your approach with great interest, and I have a question regarding the feature concatenation strategy mentioned in Section 4.6.
You explain that the original 2D features are concatenated with the fine-tuned features to preserve the generalization ability of the original 2D feature extractor while incorporating the 3D awareness of the fine-tuned features.
I’m curious about the impact on performance if, instead of concatenating the original 2D features, one were to use only the fine-tuned features directly (without any assembly strategy). Specifically, I would like to know how this would affect the performance on both within-domain and out-of-domain evaluation.
Any insights would be greatly appreciated!
The text was updated successfully, but these errors were encountered:
Hi @haolinyang-hlyang, thank you for your interest! First of all, I am sorry for the late reply due to the holidays and other stuff.
This is an interesting question. I didn't conduct extensive evaluations on this aspect so I currently don't have a formal answer to that. However, a recent paper "DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes" (https://arxiv.org/abs/2411.11921) uses our fine-tuned DINOv2 on ScanNet++ directly (without concatenation) and found it performs better than the original DINOv2 even in the outdoor driving domain. Relevant description can be found in their section 5.4. Ablation Studies and Feature Extractor in 7. Implementation Details. However, in this paper, those featured are used for motion mask extraction, which is different with the tasks (semantic segmentation and depth estimation) we originally considered.
I hope the above explanation may offer some insights - but to answer your question formally, I need to find some time to conduct this evaluation on semantic segmentation and depth estimation and update later.
I would like to thank you for your excellent work in this paper. I’ve been following your approach with great interest, and I have a question regarding the feature concatenation strategy mentioned in Section 4.6.
You explain that the original 2D features are concatenated with the fine-tuned features to preserve the generalization ability of the original 2D feature extractor while incorporating the 3D awareness of the fine-tuned features.
I’m curious about the impact on performance if, instead of concatenating the original 2D features, one were to use only the fine-tuned features directly (without any assembly strategy). Specifically, I would like to know how this would affect the performance on both within-domain and out-of-domain evaluation.
Any insights would be greatly appreciated!
The text was updated successfully, but these errors were encountered: