List of Temporal Sentence Grounding papers.
The task is also usually referred to as:
- Temporal Sentence Grounding (TSG)
- Video Moment Retrieval (VMR)
- Temporal Activity Localization via Language Query (TALL)
- 1 Survey
- 2 Datasets
- 3 Paper
- 2017: Fully Supervised
- 2018: Fully Supervised, Weakly Supervised
- 2019: Fully Supervised, Weakly Supervised
- 2020: Fully Supervised, Weakly Supervised
- 2021: Fully Supervised, Weakly Supervised, Zero-Shot
- 2022: Fully Supervised, Weakly Supervised, Point-supervised/Glance
- 2023: Fully Supervised, Weakly Supervised, Point-supervised/Glance, Zero-Shot
- 2024: Fully Supervised, Weakly Supervised
- [TPAMI'23] Temporal Sentence Grounding in Videos: A Survey and Future Directions. NTU 孙爱欣团队
- [ACM Comput. Surv.'23] A Survey on Video Moment Localization. 哈工大 聂礼强团队
- Charades-STA: VGG, C3D, I3D, CLIP+SF
- TACoS: C3D, I3D
- ActivityNet Captions: C3D
- QVHighlights: CLIP+SF
首次提出TSG任务。
Proposal-based
- [ICCV'17] TALL: Temporal Activity Localization via Language Query. 南加大 高继扬 [code]
- [ICCV'17] Localizing Moments in Video with Natural Language. 伯克利 Lisa Anne Hendricks [code]
Proposal-based
- [EMNLP'18] Temporally Grounding Natural Sentence in Video. NUS Tat-Seng Chua团队
- [IJCAI'18] Multi-modal Circulant Fusion for Video-to-Language and Backward. 天大 韩亚洪团队
- [ACM MM'18] Cross-modal Moment Localization in Videos. 山东大学 聂礼强团队 [code]
- [SIGIR'18] Attentive Moment Retrieval in Videos. 山东大学 聂礼强团队 [code]
Proposal-free
- [AAAI'19] Localizing Natural Language in Videos. 腾讯AI lab
Reconstruction-based
- [NeurIPS'18] Weakly Supervised Dense Event Captioning in Videos. 清华 朱文武团队 [code]
- 首次提出弱监督密集事件描述,在训练中涉及到了TSG问题
Proposal-based
- [AAAI'19] Semantic Proposal for Activity Localization in Videos via Sentence Query. 复旦 姜育刚团队
- [CVPR'19] MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. UCSB Da Zhang
- [ACM MM'19] Exploiting Temporal Relationships in Video Moment Localization with Natural Language. UR 罗杰波团队 [code]
- [NeurIPS'19] Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos. 清华 朱文武团队 [code]
- [SIGIR'19] Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos. 浙大 赵洲团队 [code]
- [WACV'19] MAC: Mining Activity Concepts for Language-based Temporal Localization. 南加大 [code]
Proposal-free
- [AAAI'19] Multilevel Language and Vision Integration for Text-to-Clip Retrieval. BU Huijuan Xu [code]
- [AAAI'19] To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression. 清华 朱文武团队 [code]
- [EMNLP'19] DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization. 浙大 肖俊团队
RL-based
- [AAAI'19] Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. 百度
- [CVPR'19] Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. 中科院 王亮团队
MIL-based
- [CVPR'19] Weakly Supervised Video Moment Retrieval From Text Queries. UCR Amit K. Roy-Chowdhury团队 [code]
- 正式提出weakly supervised temporal sentence grounding任务。
- [EMNLP'19] WSLLN:Weakly Supervised Natural Language Localization Networks. Salesforce
Proposal-based
- [AAAI'20] Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. UR 罗杰波团队 [code]
- 首次提出2D map的方法,后面proposal-based的论文大多都是基于这个方法。
Proposal-free
- [ACL'20] Span-based Localizing Network for Natural Language Video Localization. NTU 孙爱欣团队 [code]
Reconstruction-based
- [AAAI'20] Weakly-Supervised Video Moment Retrieval via Semantic Completion Network. 浙大 赵洲团队 [code]
- 首次在WTSG任务中使用掩码重建的方法。
Proposal-based
- [SIGIR'21] Deconfounded Video Moment Retrieval with Causal Intervention. NUS Tat-Seng Chua 团队 [code]
- 将因果推理引入TSG,消除视频中的位置信息带来的偏差
- [CVPR'21] Interventional Video Grounding with Dual Contrastive Learning. 北邮 南国顺
- Contrastive learning + causal intervention
- [CVPR'21] Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval. 湖南大学 曹达团队
- [ICCV'21] Fast Video Moment Retrieval. 中科院 徐常胜团队
Proposal-free
- [TPAMI'21] Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework. NTU 孙爱欣团队
- VSLNet (ACL'20)的扩展版
- [TMM'21] Frame-Wise Cross-Modal Matching for Video Moment Retrieval. 齐鲁工业大学 程志勇团队 [code]
DETR-based
- [NeurIPS'21] QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries. UNC 雷杰 [code]
- 将MR和HD任务联合,首次将DETR引入VMR领域。
首次提出无监督任务。
- [ICCV'21] Zero-shot Natural Language Video Localization. 首尔大学 Jonghyun Choi团队 [code]
- [TCSVT'21] Learning Video Moment Retrieval Without a Single Annotated Video. 中科院 徐常胜团队
Proposal-based
- [SIGIR'22] You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos. 上交 周曦团队 [code]
- [TCSVT'22] Efficient Video Grounding With Which-Where Reading Comprehension. 上交 周曦团队
Proposal-free
- [TIP'22] HiSA: Hierarchically Semantic Associating for Video Temporal Grounding. 西电 邓成团队 [code]
DETR-based
- [CVPR'22] UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection. 腾讯ARC lab [code]
Reconstruction-based
- [AAAI'22] Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining. 北大 刘洋团队 [code]
- [CVPR'22] Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning. 北大 刘洋团队 [code]
- 挖掘负样本信息,以更好地区分同一视频中极易混淆的场景。
- 后续的弱监督方法都是以CPL为baseline做的了。
首次提出单帧监督任务。
- [TMM'22] Point-Supervised Video Temporal Grounding. 西电 邓成团队
- [SIGIR'22] Video Moment Retrieval from Text Queries via Single Frame Annotation. 复旦 姜育刚团队 [code]
Proposal-based
- [AAAI'23] Phrase-Level Temporal Relationship Mining for Temporal Sentence Localization. 北大 刘洋团队 [code]
- [ICCV'23] G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory. 北大 邹月娴团队
Proposal-free
DETR-based
- [ACL'23] MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction. NTU 孙爱欣团队 [code]
- [CVPR'23] Query-Dependent Video Representation for Moment Retrieval and Highlight Detection. 成均馆大学 Jae-Pil Heo团队 [code]
- [ICCV'23] Knowing Where to Focus: Event-aware Transformer for Video Grounding. 延世大学 Kwanghoon Sohn团队 [code]
- [NeurIPS'23] MomentDiff: Generative Video Moment Retrieval from Random to Real. 中科大 谢洪涛团队 [code]
- 利用diffusion的思想去噪生成预测时刻
Bias
- [AAAI'23] Curriculum Multi-Negative Augmentation for Debiased Video Grounding. 清华 朱文武团队
Reconstruction-based
- [CVPR'23] Weakly Supervised Temporal Sentence Grounding with Uncertainty-Guided Self-training. 东京大学 Yoichi Sato团队
- [CVPR'23] Iterative Proposal Refinement for Weakly-Supervised Video Grounding. 北大 邹月娴团队
- [ICCV'23] SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval. 韩国科学技术院 Chang D. Yoo团队
- [ICCV'23] D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation. 腾讯优图 [code]
- [ACL'23] Generating Structured Pseudo Labels for Noise-resistant Zero-shot Video Sentence Localization. 北大 刘洋团队 [code]
Proposal-based
- [ACM MM'24] Maskable Retentive Network for Video Moment Retrieval. 合工大 汪萌团队 [code]
- [AAAI'24] Exploiting Auxiliary Caption for Video Grounding. 北大 邹月娴团队
Proposal-free
DETR-based
- [AAAI'24] Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval. 中科大 谢洪涛团队 [code]
- 针对模态不平衡问题
- [AAAI'24] TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection. 华中师范 谢伟团队 [code]
- [CVPR'24] Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection. 西交 魏平团队 [code]
- [CVPR'24] Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection. 清华 李秀团队 [code]
- [ACM MM'24] Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval. 港浸大 魏骁勇团队 [code]
Bias
- [AAAI'24] Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video. 哈工大 张维刚团队 [code]
Reconstruction-based
- [AAAI'24] Gaussian Mixture Proposals with Pull-Push Learning Scheme to Capture Diverse Events for Weakly Supervised Temporal Video Grounding. 首尔大学 Jin Young Choi团队 [code]
- [AAAI'24] Omnipotent Distillation with LLMs for Weakly-Supervised Natural Language Video Localization: When Divergence Meets Consistency. NTU Alex C. Kot团队
- [PR'24] Triadic temporal-semantic alignment for weakly-supervised video moment retrieval. 山东大学 周风余团队
- [ACL'24] Exploiting Intrinsic Multilateral Logical Rules for Weakly Supervised Natural Language Video Localization. 西电 邓成团队