Implementation of a material estimation model using audio-visual cues using CLIP and Whisper to encode the Image and Audio inputs, and an LLM to align them to a fixed size text embedding space which is later used for Material class prediction. The model description is available at Report
To run the training code:
python train.py