Skip to content

Latest commit

 

History

History
8 lines (5 loc) · 465 Bytes

README.md

File metadata and controls

8 lines (5 loc) · 465 Bytes

Multimodal Material Estimation

Implementation of a material estimation model using audio-visual cues using CLIP and Whisper to encode the Image and Audio inputs, and an LLM to align them to a fixed size text embedding space which is later used for Material class prediction. The model description is available at Report

MLP

To run the training code: python train.py