Yingying Fan, Yu Wu, Bo Du and Yutian Lin
Code for NeurIPS 2023 paper Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
- python3.7+
You should install CLIP and LAION-CLAP and run
pip install -r requirement.txt
- Resnet and VGGish features can be downloaded from Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. We also provide visual feature extracted by CLIP and audio feature extracted by LAION-CLAP.
- Put the downloaded features into data/feats/.
- We use CLIP(ViT-B/16) and LAION-CLAP pre-trained on audioset.
python main.py --mode label_denoise --language refine_label/denoised_label.npz --refine_label refine_label/final_label.npz
- Resnet and VGGish features
python main.py --mode train_model --num_layers 6 --lr 8e-5 --refine_label refine_label/final_label.npz --save_model true --checkpoint LSLD.pt
- CLAP and CLIP features (Recommended)
python main.py --mode train_model --num_layers 4 --lr 2e-4 --refine_label refine_label/final_label.npz --save_model true --checkpoint LSLD.pt
- We put the pre-trained model in this Link
python main.py --mode test_LSLD --checkpoint LSLD.pt
@article{fan2023revisit,
title={Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective},
author={Fan, Yingying and Wu, Yu and Du, Bo and Lin, Yutian },
journal={arXiv preprint arXiv:2306.00595},
year={2023}
}