Skip to content

Official implementation of "Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM"

License

Notifications You must be signed in to change notification settings

pipixin321/HolmesVAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

If you like our project, please give us a star ⭐ on GitHub for latest update.

🎨 Project Page

📰 News

  • [2025.01.05] 🔥🔥🔥We release Holmes-VAU, an upgraded version of Holmes-VAD, featuring improvements in annotation granularity, quantity, and quality, as well as utilizing a more powerful foundational MLLM model. The HIVAU-70k benchmark is available now, please stay tuned!
  • [2024.07.01] 🔥🔥🔥 Our inference code is available, and we release our model at [HolmesVAD-7B].
  • [2024.06.12] 👀 Our HolmesVAD and VAD-Instruct50k will be available soon, welcome to star ⭐ this repository for the latest updates.

😮 Highlights

Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations.

  • Firstly, towards unbiased and explainable VAD system, we construct the first largescale multimodal VAD instruction-tuning benchmark, i.e., VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM).
MY ALT TEXT
  • Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content.
MY ALT TEXT

🛠️ Requirements and Installation

  • Python >= 3.10
  • Pytorch == 2.0.1
  • CUDA Version >= 11.7
  • transformers >= 4.37.2
  • Install required packages:
# inference only
git clone https://github.com/pipixin321/HolmesVAD.git
cd HolmesVAD
conda create -n holmesvad python=3.10 -y
conda activate holmesvad
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install decord opencv-python pytorchvideo
# additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

🤗 Demo

CLI Inference

CUDA_VISIBLE_DEVICES=0 python demo/cli.py --model-path ./checkpoints/HolmesVAD-7B --file ./demo/examples/vad/RoadAccidents133_x264_270_451.mp4

Gradio Web UI

CUDA_VISIBLE_DEVICES=0 python demo/gradio_demo.py

Stargazers over time

Stargazers over time

Citation

If you find this repo useful for your research, please consider citing our paper:

@article{zhang2024holmes,
  title={Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM},
  author={Zhang, Huaxin and Xu, Xiaohao and Wang, Xiang and Zuo, Jialong and Han, Chuchu and Huang, Xiaonan and Gao, Changxin and Wang, Yuehuan and Sang, Nong},
  journal={arXiv preprint arXiv:2406.12235},
  year={2024}
}

About

Official implementation of "Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages