Skip to content

fyyCS/LSLD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Yingying Fan, Yu Wu, Bo Du and Yutian Lin

Code for NeurIPS 2023 paper Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective

Method Overview

Environment

  • python3.7+

You should install CLIP and LAION-CLAP and run

pip install -r requirement.txt

Prepare data

  1. Resnet and VGGish features can be downloaded from Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. We also provide visual feature extracted by CLIP and audio feature extracted by LAION-CLAP.
  2. Put the downloaded features into data/feats/.
  3. We use CLIP(ViT-B/16) and LAION-CLAP pre-trained on audioset.

Label Denoising

python main.py --mode label_denoise --language refine_label/denoised_label.npz --refine_label refine_label/final_label.npz

Train the model

  1. Resnet and VGGish features
python main.py --mode train_model --num_layers 6 --lr 8e-5 --refine_label refine_label/final_label.npz --save_model true --checkpoint LSLD.pt
  1. CLAP and CLIP features (Recommended)
python main.py --mode train_model --num_layers 4 --lr 2e-4 --refine_label refine_label/final_label.npz --save_model true --checkpoint LSLD.pt

Test the model

  • We put the pre-trained model in this Link
python main.py --mode test_LSLD --checkpoint LSLD.pt

Citation

@article{fan2023revisit,
  title={Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective},
  author={Fan, Yingying and Wu, Yu and Du, Bo and Lin, Yutian },
  journal={arXiv preprint arXiv:2306.00595},
  year={2023}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages