The repository builds a two approaches for video classification (or action recognition) using UCF50 with PyTorch. To propagate a video through a model, we randomly select a specific number of frames from it. Below are two neural nets models:
UCF50 has total 6676 videos from 50 actions. Videos have various time lengths (frames) and different 2d image size; the shortest is 28 frames.
You may find a Google Colab notebook with some exploratory data analysis following this link:
Model prediction results and examples can be found here:
Videos are viewed as 3d images randomly chosen from the time length and passed through a pretrained CNN model (for instance, ResNet18 pretrained on ImageNet). For each video, the logit outputs of images are averaged to return a single logit and classify the video.
The model is a pair of CNN encoder and RNN decoder (see figure below):
-
[encoder] A CNN function encodes (meaning compressing dimension) every 2D image x(t) into a 1D vector z(t).
-
[decoder] A RNN receives a sequence input vectors z(t) from the CNN encoder and outputs another 1D sequence h(t). A final fully-connected neural net is concatenated at the end for categorical predictions. Additionally, when the model is being trained, there is a dropout layer in between the RNN and a fully-connected layer.
-
For a CNN model, we use the existing models pretrained on ImageNet.
- Training dataset contains 4777 videos.
- Validation dataset contains 547 videos.
- Testing dataset contains 1891 videos.
- A single epoch of training and validation takes approximately 5-8 minutes on Google Colab Pro with the default parameters of the application.
You may install all the required libraries with pip:
pip install -r requirements.txt
You may directly download the dataset from the original website and decompress it in the root directory of the project with the following code:
!wget --no-check-certificate "https://www.crcv.ucf.edu/data/UCF50.rar"
!unrar e UCF50.rar /content/data/videos/
You may run the model with the default parameters without passing any arguments. It will train the model and generate predictions:
python train.py
For more information about arguments to pass. Refer to the help
message:
usage: train.py [-h] [--name NAME] [--data DATA] [--batch-size BATCH_SIZE]
[--frames-cnt FRAMES_CNT] [--model-type MODEL_TYPE] [--bilstm]
[--cnn-model CNN_MODEL] [--epochs EPOCHS]
[--scheduler-patience SCHEDULER_PATIENCE]
[--scheduler-factor SCHEDULER_FACTOR]
[--learning-rate LEARNING_RATE] [--n-workers N_WORKERS]
[--gpu] [--predict] [--continue-training] [--in-memory]
[--stride STRIDE] [--reload-interval RELOAD_INTERVAL]
optional arguments:
-h, --help show this help message and exit
--name NAME, -n NAME Experiment name (for saving best models and prediction
results).
--data DATA, -d DATA Path to dir with videos folder and train/test files.
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size.
--frames-cnt FRAMES_CNT, -f FRAMES_CNT
Number of video frames for random selection.
--model-type MODEL_TYPE
Model to run. Two options: 'cnn-avg' or 'cnn-rnn'.
--bilstm Whether the LSTM is bidirectional
--cnn-model CNN_MODEL
CNN pretrained Model to use. Two options: 'resnet18'
or 'resnet34'.
--epochs EPOCHS, -e EPOCHS
Number of training epochs.
--scheduler-patience SCHEDULER_PATIENCE
ReduceLROnPlateau scheduler patience.
--scheduler-factor SCHEDULER_FACTOR
ReduceLROnPlateau scheduler factor.
--learning-rate LEARNING_RATE, -lr LEARNING_RATE
Learning rate for the optimizer.
--n-workers N_WORKERS
Number of workers for data loaders.
--gpu Whether to run using GPU or not.
--predict Whether to only make predictions or to train a model,
too.
--continue-training Whether to continue training an stored model or train
a new one.
--in-memory Whether to load dataset into memory at once or one-by-
one
--stride STRIDE Stripe value for data loader
--reload-interval RELOAD_INTERVAL
Specifies after how many epochs the dataset should be
reloaded. One - after each epoch.
After having a model file generated, you can make a prediction for a video using the following command:
python predict.py --model-path [path to a model] --video-path [path to a video]
For more information about arguments to pass. Refer to the help
message:
usage: predict.py [-h] [--model-path MODEL_PATH] [--video-path VIDEO_PATH]
[--frames-cnt FRAMES_CNT] [--model-type MODEL_TYPE]
[--bilstm] [--cnn-model CNN_MODEL] [--gpu]
optional arguments:
-h, --help show this help message and exit
--model-path MODEL_PATH, -m MODEL_PATH
Path to a trained model.
--video-path VIDEO_PATH, -v VIDEO_PATH
Path to a video file.
--frames-cnt FRAMES_CNT, -f FRAMES_CNT
Number of video frames for random selection.
--model-type MODEL_TYPE
Model to run. Two options: 'cnn-avg' or 'cnn-rnn'.
--bilstm Whether the LSTM is bidirectional
--cnn-model CNN_MODEL
CNN pretrained Model to use. Two options: 'resnet18'
or 'resnet34'.
--gpu Whether to run using GPU or not.