Skip to content

This repo includes the implementations of the Neural Networks for partial video copy detection.

Notifications You must be signed in to change notification settings

vanhao-le/PVCD-Pytorch

Repository files navigation

PVCD-Pytorch

This repo includes the implementations of the Neural Networks for partial video copy detection.

[1] Temporal Relational Reasoning Network: http://relation.csail.mit.edu/

Model input

The protocol to generate the input for CNN

A video -> total_frames -> num_segments -> d_frames -> k * d_frames, where d_frames means the frame pairs such as 2-frames, 3-frames. K_random stands for the number of subsamples of d_frames.

In particular, a VideoRecord = [image_list_path, total_frames, classIDx] e.g., The VideoRecord [v_CricketShot_g08_c01, 75, 0] means that the video 'v_CricketShot_g08_c01' has a total of 75 frames. After that, we select 'num_segments' = 30 evenly spaced samples for generating the number of segments. From num_segments (i.e., 30 frames), we then generate a combination of d-frames (e.g., 2_frames, 3_frames) like (0, 1), (0, 2) (0,3) ... Specifically, the total number of the combination of 2_frames is 30! / 2! / (30-2)!). In order to train efficiently, we only select k subsamples over the overall combinations for efficient training. These steps are repeated for the other d_frames.

As a result, the input of data to feed the model is formated [batch_size, num_segments, num_channels, img_width, img_width]. The k * d_frames is then generated by TRN Module within the model.

VCDB - Dataset

[1] https://fvl.fudan.edu.cn/dataset/vcdb/list.htm

This original dataset [1] contains 528 videos included 28 classes. For each video, we have used a GUI software to cut short video segments (e.g., 2-4 seconds) that served as positive videos for training. However, there are some videos containing different visual content within the same class. Thus, we have removed those inconsistent videos and obtained 454 short videos of 28 classes.

Dataset preparation

From those videos (i.e., 454 videos), we splitted them into three parts for training, validation and testing. In particular, we used 70% of the data for training. For remaining (30%) data, we divided equally them into two parts containing 50% of the total in 30%. More precise, the total numbers of videos that use for training, evaluation and testing are 317, 68, 69, respectively. The 'prepare_vcdb_dataset.py' illustrates the pre-proccesing steps.

Extract frames

The frames of all videos are extracted by using FFmpeg library. Thus, a video contains 'num_frames' / 'total_frames'.

Training

The image features was extracted from the pre-trained ResNet50 (2048-D) model.

We have trained the TRN network with the following paramters:

num_segments=30, d_frames=8, k_random=3, num_classes=28, image_size=224,

batch_size=4, num_workers=2, learning_rate=1e-3, momentum=0.9, weight_decay=5e-4, step_size=10, num_epochs=50, dropout=0.6, img_feature_dim=2048

Result

The VCDB results The VCDB results

For testing, the accuracy reached: 4/69 -> it seems like the model was overfitting.

SVD - Dataset

[2] https://svdbase.github.io/

Update git after ignoring redundant files

git rm -rf --cached .

git add .

git commit -m ".gitignore is now working"

About

This repo includes the implementations of the Neural Networks for partial video copy detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages