Skip to content

[NeurIPS 2023] Type-to-Track: Retrieve Any Object via Prompt-based Tracking

Notifications You must be signed in to change notification settings

uark-cviu/Type-to-Track

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Type-to-Track: Retrieve Any Object via Prompt-based Tracking (NeurIPS 2023)

Type-to-Track: Retrieve Any Object via Prompt-based Tracking

Pha Nguyen, Kha Gia Quach, Kris Kitani, Khoa Luu

In NeurIPS, 2023.

Project page: uark-cviu.github.io/Type-to-Track

teaser_mot.mp4

The responsive Type-to-Track: The user provides a video sequence and a prompting request. During tracking, the system is able to discriminate appearance attributes to track the target subjects accordingly and iteratively responds to the user's tracking request. Each box color represents a unique identity.

Abstract

One of the recent trends in vision problems is to use natural language captions to describe the objects of interest. This approach can overcome some limitations of traditional methods that rely on bounding boxes or category annotations. This paper introduces a novel paradigm for Multiple Object Tracking called Type-to-Track, which allows users to track objects in videos by typing natural language descriptions. We present a new dataset for that Grounded Multiple Object Tracking task, called GroOT, that contains videos with various types of objects and their corresponding textual captions describing their appearance and action in detail. Additionally, we introduce two new evaluation protocols and formulate evaluation metrics specifically for this task. We develop a new efficient method that models a transformer-based eMbed-ENcoDE-extRact framework (MENDER) using the third-order tensor decomposition. The experiments in five scenarios show that our MENDER approach outperforms another two-stage design in terms of accuracy and efficiency, up to 14.7% accuracy and 4× speed faster.

Introduction

GroOT contains videos with various types of objects and their corresponding textual captions of 256K words describing their appearance and action in detail. To cover a diverse range of scenes, GroOT was created using official videos and bounding box annotations from the MOT17, TAO and MOT20.

Here are examples of what's annotated on videos of the GroOT dataset:

teaser_data.mp4

Annotations

v1.0:

Notes:

  • Test annotation for tracklet captions of MOT17 is a sub-optimal ground truth. That is the raw tracking data of the best-performant tracker at the time we constructed the annotations (i.e. BoT-SORT at 80.5% MOTA and 80.2% IDF1).
  • The `captions' field includes the first caption for appearance and the second for action. Any missing captions have been filled with a `None' value.
  • The physical characteristics of a person or their personal accessories, such as their clothing, bag color, and hair color are considered to be part of their appearance. Therefore, the appearance captions include verbs carrying or holding to describe personal accessories.

Licensing:

The annotations of GroOT, as well as the original source videos of MOT17 and TAO, are released under a CC BY-NC-SA 3.0 license per their creators. See motchallenge.net for details.

BibTeX

@article{nguyen2023type,
    title        = {Type-to-Track: Retrieve Any Object via Prompt-based Tracking},
    author       = {Nguyen, Pha and Quach, Kha Gia and Kitani, Kris and Luu, Khoa},  
    journal      = {Advances in Neural Information Processing Systems},
    year         = 2023
}

This page was built using the Academic Project Page Template.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

[NeurIPS 2023] Type-to-Track: Retrieve Any Object via Prompt-based Tracking

Resources

Stars

Watchers

Forks