VDMA : Video Question Answering with Dynamically Generated Multi-Agents

This is official implementation for paper: VDMA: Video Question Answering with Dynamically Generated Multi-Agents

🚩 Preparation

1.0 Download the EgoSchemaVQA dataset

Please refer to the following official repository to download the dataset.
https://github.com/egoschema/EgoSchema

You can download the Question file of EgoSchemaVQA dataset from the following link:
This link is from LLoVi's github.
https://drive.google.com/file/d/13M10CB5ePPVlycn754_ff3CwnpPtDfJA/view?usp=drive_link

1.1 GPT4o Model

To use GPT-4o, you need to create a list of images generated from the EgoSchemaVQA dataset. You can use the following command to generate the list of images:

python3 convert_videos_to_images.py

1.2 Azure GPT4 Vision Model

To use the Azure GPT-4 Vision Model, you need to create the video index file.
※If you are not using Azure GPT-4 Vision, please comment out the relevant code.

For detailed information about the Azure GPT-4 Vision Model, please refer to the following link:
https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/gpt-with-vision?tabs=rest%2Csystem-assigned%2Cresource#use-vision-enhancement-with-video

python3 create_video_index.py

1.3 LLoVi Caption Data

Our model uses the LLoVi caption data. You can download the LLoVi caption data from the following link.
https://github.com/CeeZh/LLoVi

Then you need to set the collect path of the LLoVi caption data in the following file.
retrieve_video_clip_captions function is regarding the LLoVi caption data.

retrieve_video_clip_captions.py

1.4 Set the environment variables

You need to set the environment variables in the following file.

docker/.env

Our model uses the Azure OpenAI, Azure Blob Storage, Azure Computer Vision, OpenAI. So, Please set the access infomation into the environment variables.

🚀 Run

Build the docker image and run the container.

cd VDMA/docker
docker compose build
docker compose up -d

Enter the container and run the following command.

python3 main.py

📄 Citation

If you find this code useful, please consider citing our paper.

@article{VDMA,
  title={VDMA: Video Question Answering with Dynamically Generated Multi-Agents},
  author={Noriyuki Kugo and Tatsuya Ishibashi and Kosuke Ono and Yuji Sato},
  journal={ArXiv},
  year={2024},
  volume={abs/2407.03610},
  url={https://github.com/PanasonicConnect/VDMA}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docker		docker
docs		docs
LICENSE		LICENSE
README.md		README.md
convert_videos_to_images.py		convert_videos_to_images.py
create_video_index.py		create_video_index.py
main.py		main.py
stage1.py		stage1.py
stage2.py		stage2.py
tools.py		tools.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VDMA : Video Question Answering with Dynamically Generated Multi-Agents

🚩 Preparation

1.0 Download the EgoSchemaVQA dataset

1.1 GPT4o Model

1.2 Azure GPT4 Vision Model

1.3 LLoVi Caption Data

1.4 Set the environment variables

🚀 Run

📄 Citation

About

Releases

Packages

Contributors 2

Languages

License

PanasonicConnect/VDMA

Folders and files

Latest commit

History

Repository files navigation

VDMA : Video Question Answering with Dynamically Generated Multi-Agents

🚩 Preparation

1.0 Download the EgoSchemaVQA dataset

1.1 GPT4o Model

1.2 Azure GPT4 Vision Model

1.3 LLoVi Caption Data

1.4 Set the environment variables

🚀 Run

📄 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages