Multimodal Material Estimation

Implementation of a material estimation model using audio-visual cues using CLIP and Whisper to encode the Image and Audio inputs, and an LLM to align them to a fixed size text embedding space which is later used for Material class prediction. The model description is available at Report

To run the training code: python train.py