This is official implementation for paper: VDMA: Video Question Answering with Dynamically Generated Multi-Agents
Please refer to the following official repository to download the dataset.
https://github.com/egoschema/EgoSchema
You can download the Question file of EgoSchemaVQA dataset from the following link:
This link is from LLoVi's github.
https://drive.google.com/file/d/13M10CB5ePPVlycn754_ff3CwnpPtDfJA/view?usp=drive_link
To use GPT-4o, you need to create a list of images generated from the EgoSchemaVQA dataset. You can use the following command to generate the list of images:
python3 convert_videos_to_images.py
To use the Azure GPT-4 Vision Model, you need to create the video index file.
※If you are not using Azure GPT-4 Vision, please comment out the relevant code.
For detailed information about the Azure GPT-4 Vision Model, please refer to the following link:
https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision?tabs=rest%2Csystem-assigned%2Cresource#use-vision-enhancement-with-video
python3 create_video_index.py
Our model uses the LLoVi caption data. You can download the LLoVi caption data from the following link.
https://github.com/CeeZh/LLoVi
Then you need to set the collect path of the LLoVi caption data in the following file.
retrieve_video_clip_captions function is regarding the LLoVi caption data.
retrieve_video_clip_captions.py
You need to set the environment variables in the following file.
docker/.env
Our model uses the Azure OpenAI, Azure Blob Storage, Azure Computer Vision, OpenAI. So, Please set the access infomation into the environment variables.
Build the docker image and run the container.
cd VDMA/docker
docker compose build
docker compose up -d
Enter the container and run the following command.
python3 main.py
If you find this code useful, please consider citing our paper.
@article{VDMA,
title={VDMA: Video Question Answering with Dynamically Generated Multi-Agents},
author={Noriyuki Kugo and Tatsuya Ishibashi and Kosuke Ono and Yuji Sato},
journal={ArXiv},
year={2024},
volume={abs/2407.03610},
url={https://github.com/PanasonicConnect/VDMA}
}