Skip to content

This repository provides the official implementation of "DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition"

License

Notifications You must be signed in to change notification settings

hayatkhan8660-maker/DVFL-Net

Repository files navigation

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition

Hayat Ullah, Muhammad Ali Shafique, Abbas Khan, Arslan Munir


Abstract: The landscape of video recognition has undergone a significant transformation, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures in order to achieve better accuracy. While CNNs, especially 3D variants, have excelled in capturing spatio-temporal dynamics for action recognition, recent developments in Transformer models, with their self-attention mechanisms, have proven highly effective in modeling long-range dependencies across space and time. Despite their state-of-the-art performance on prominent video recognition benchmarks, the computational demands of Transformers, particularly in processing dense video data, remain a significant hurdle. To address these challenges, we introduce a lightweight Video Focal Modulation Network named DVFL-Net, which distills the spatio-temporal knowledge from large pre-trained teacher to nano student model, making it well-suited for on-device applications. By leveraging knowledge distillation and spatial-temporal feature extraction, our model significantly reduces computational overhead (approximately 7×) while maintaining high performance in video recognition tasks. We combine the forward Kullback–Leibler (KL) divergence and spatio-temporal focal modulation to distill the local and global spatio-temporal context from the Video-FocalNet Base (teacher) to our proposed nano VFL-Net (student) model. We extensively evaluate our DVFL-Net, both with and without forward KL divergence, against recent state-of-the-art HAR approaches on UCF50, UCF101, and HMDB51 datasets. Further, we conducted a detailed ablation study in forward KL divergence settings and reports the obtained observations. The obtained results confirm the superiority of the distilled VFL-Net (i.e., DVFL-Net) over existing methods, highlighting its optimal tradeoff between performance and computational efficiency, including reduced memory usage and lower GFLOPs, making it a highly efficient solution for HAR tasks.

Environment Setup

Please follow INSTALL.md for installation.

Dataset Preparation

Please follow DATA.md for data preparation.

Model Zoo

HMDB51

Model Depth Dim Kernels Top-1 Top-5 Parameters Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 84.2 93.6 157M ckpt
Video-FocalNet-S [2,2,18,2] 96 [3,5] 83.4 92.4 88M ckpt
Video-FocalNet-T [2,2,6,2] 96 [3,5] 81.5 91.1 49M ckpt
VFL-Net [1,1,2,1] 96 [3,5] 71.6 88.5 22M ckpt
DVFL-Net [1,1,2,1] 96 [3,5] 82.7 92.8 22M ckpt

UCF50

Model Depth Dim Kernels Top-1 Top-5 Parameters Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 91.1 98.7 157M ckpt
Video-FocalNet-S [2,2,18,2] 96 [3,5] 90.8 97.9 88M ckpt
Video-FocalNet-T [2,2,6,2] 96 [3,5] 90.1 97.1 49M ckpt
VFL-Net [1,1,2,1] 96 [3,5] 81.4 91.4 22M ckpt
DVFL-Net [1,1,2,1] 96 [3,5] 86.6 95.2 22M ckpt

UCF101

Model Depth Dim Kernels Top-1 Top-5 Parameters Download
Video-FocalNet-B [2,2,18,2] 128 [3,5] 91.9 99.4 157M ckpt
Video-FocalNet-S [2,2,18,2] 96 [3,5] 91.2 98.9 88M ckpt
Video-FocalNet-T [2,2,6,2] 96 [3,5] 90.7 98.4 49M ckpt
VFL-Net [1,1,2,1] 96 [3,5] 82.5 93.7 22M ckpt
DVFL-Net [1,1,2,1] 96 [3,5] 88.4 96.1 22M ckpt

Training (Teacher Pretraining)

To train teacher model on a given dataset, run the following:

torchrun --nproc_per_node <num-of-gpus-to-use>  main.py --cfg <config-file> --output <output-directory> --opts DATA.NUM_Frames <number of frames> 
  • --nproc_per_node: set the number of GPU devices, in our case we used 3 GPUs.
  • --cfg: the path of config file (located in configs directory), containing the model, data, and training configurations.
  • --output: the path of output directory that will contain the training history (incluing both model weights and training logs).
  • --opts: it allow user to provide additional data related (i.e., DATA.NUM_Frames) or training related (i.e., TRAIN.EPOCHS).

For instance, to train the teacher model (pretraining) using 3 GPUs on the UCF101 dataset, run the following command:

torchrun --nproc_per_node 3 main.py --cfg configs/ucf101/video-focalnet_base.yaml --output output/ --opts DATA.NUM_FRAMES 8

Note: During pretraining, we intialized our teacher model with the ImageNet-1K weights of FocalNets model. To initialize a model with pretrained weights, set TRAIN.PRETRAINED_PATH to the path of the pretrained model weights. This can be configured either in the configuration file (located in the configs directory) or directly in the bash script. Alternatively, to train the model from scratch, simply leave TRAIN.PRETRAINED_PATH empty.

Alternatively, the above step can be done by simply running the bash.sh file located in scripts directory. For instance:

bash scripts/ucf101/video-focalnet_base.sh

Training (Knowledge Distillation)

To train the student model in knowledge distillation settings (under the supervision of pretrained teacher model) on a given dataset, run the following:

torchrun --nproc_per_node <num-of-gpus-to-use> main_KD.py --cfg <config-file> --output <output-directory> --opts DATA.NUM_Frames <number of frames> KD.ALPHA <alpha value> KD.BETA <beta value> KD.GAMMA <gamma value> KD.TEMPERATURE <temperature value>
  • --nproc_per_node: set the number of GPU devices, in our case we used 3 GPUs.
  • --cfg: the path of config file (located in configs directory), containing the model, data, and training configurations.
  • --output: the path of output directory that will contain the training history (incluing both model weights and training logs).
  • --opts: it allow user to provide additional data related (i.e., DATA.NUM_Frames), training related (i.e., TRAIN.EPOCHS), and knowledge distillation related (i.e., KD.ALPHA, KD.BETA, KD.GAMMA, and KD.TEMPERATURE).

For instance, to train the student model (VFL-Net) in knowledge settings under the supervision of teacher model (video-focalnet_base) using 3 GPUs on the UCF101 dataset, run the following command:

torchrun --nproc_per_node 3 main_KD.py --cfg configs/ucf101/VFL-Net_kd.yaml --output output/ --opts DATA.NUM_FRAMES 8 KD.ALPHA 0.2 KD.BETA 0.3 KD.GAMMA 0.5 KD.TEMPERATURE 10

Alternatively, the above step can be done by simply running the bash.sh file located in scripts directory. For instance:

bash scripts/ucf101/VFL-Net_kd.sh

Evaluation

To evaluate the performance of pretrained model on a test set of given dataset, run the following:

torchrun --nproc_per_node <num-of-gpus-to-use> main.py --eval --cfg <config-file> --resume <checkpoint> TEST.NUM_CLIP <number of clips> TEST.NUM_CROP <number of crops>
  • --nproc_per_node: set the number of GPU devices, in our case we used 3 GPUs.
  • --eval: it set the model mode to evaluation and avoid trianing.
  • --cfg: the path of config file (located in configs directory), containing the model, data, and training configurations.
  • --resume: the of pretrained weights.
  • --opts: it allow user to provide additional data related (i.e., DATA.NUM_Frames) or test related (i.e., TEST.NUM_CLIP and TEST.NUM_CROP).

For instance, to evaluate the student model trained in knowledge distillation settings using 1 GPU1 on the test set of UCF101 dataset, run the following command:

torchrun --nproc_per_node 1 main.py --eval --cfg configs/ucf101/VFL-Net.yaml --resume output/ucf101/VFL-Net/LR_0.1_EP_120_OPT_sgd_FRAMES_8/ckpt_epoch_119.pth TEST.NUM_CLIP 4 TEST.NUM_CROP 3

Getting FLOPs

python get_flops.py --cfg <provide the path of configuration file of your model>

Visualization (Spatio-Temporal Focal Modulation)

         Input Video                   First Layer                  Second Layer                Third Layer                  Fourth Layer

Basketball Frames First Modulation BasketBall First Modulation BasketBall First Modulation BasketBall First Modulation BasketBall
Benchpress Frames First Modulation Benchpress First Modulation Benchpress First Modulation Benchpress First Modulation Benchpress
Golf Swing Frames First Modulation Golf Swing First Modulation Golf Swing First Modulation Golf Swing First Modulation Golf Swing
Mixing Frames First Modulation Mixing First Modulation Mixing First Modulation Mixing First Modulation Mixing
Playing Guitar Frames First Modulation Playing Guitar First Modulation Playing Guitar First Modulation Playing Guitar First Modulation Playing Guitar
Playing Tabla Frames First Modulation Playing Tabla First Modulation Playing Tabla First Modulation Playing Tabla First Modulation Playing Tabla
Pullups Frames First Modulation Pullups First Modulation Pullups First Modulation Pullups First Modulation Pullups
Skiing Frames First Modulation Skiing First Modulation Skiing First Modulation Skiing First Modulation Skiing

Contact

For any questions or inquiries, please open an issue in this repository or contact us at hullah2024@fau.edu.

Acknowledgements

We acknowledge that our work is inspired by the FocalNets and Video-FocalNets repositories. We sincerely thank the authors for making their code publicly available.

About

This repository provides the official implementation of "DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published