Authors official PyTorch implementation of the EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition. If you use this code for your research, please cite our paper.
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Niki Maria Foteinopoulou and Ioannis Patras
Abstract: Facial Expression Recognition (FER) is a crucial task in affective computing, but its conventional focus on the seven basic emotions limits its applicability to the complex and expanding emotional spectrum. To address the issue of new and unseen emotions present in dynamic in-the-wild FER, we propose a novel vision-language model that utilises sample-level text descriptions (i.e. captions of the context, expressions or emotional cues), as natural language supervision, aiming to enhance the learning of rich latent representations, for zero-shot classification. To test this, we evaluate using zero-shot classification of the model trained on sample-level descriptions, on four popular dynamic FER datasets. Our findings show that this approach yields significant improvements when compared to baseline methods. Specifically, for zero-shot video FER, we outperform CLIP by over 10% in terms of Weighted Average Recall and 5% in terms of Unweighted Average Recall on several datasets. Furthermore, we evaluate the representations obtained from the network trained using sample-level descriptions on the downstream task of mental health symptom estimation, achieving performance comparable or superior to state-of-the-art methods and strong agreement with human experts. Namely, we achieve a Pearson's Correlation Coefficient of up to 0.85, which is comparable to human experts' agreement.
In a nutshell, we follow the CLIP contrastive training paradigm to jointly optimise a video and a
text encoder. The video and text encoders of the network are jointly trained using a contrastive loss over
the cosine similarities of the video-text pairings in the mini-batch.
More specifically, the video encoder (
We recommend installing the required packages using python's native virtual environment as follows:
$ python -m venv venv
$ source venv/bin/activate
(venv) $ pip install --upgrade pip
(venv) $ pip install -r requirements.txt
For using the aforementioned virtual environment in a Jupyter Notebook, you need to manually add the kernel as follows:
(venv) $ python -m ipykernel install --user --name=venv
The weights used for the downstream task (without the FC layer) can be found here
This work is supported by EPSRC DTP studentship (No. EP/R513106/1) and EU H2020 AI4Media (No. 951911). This research utilised Queen Mary's Apocrita HPC facility, supported by QMUL Research-IT. http://doi.org/10.5281/zenodo.438045
@inproceedings{foteinopoulou_emoclip_2024,
title = {{EmoCLIP}: {A} {Vision}-{Language} {Method} for {Zero}-{Shot} {Video} {Facial} {Expression} {Recognition}},
author = {Foteinopoulou, Niki Maria and Patras, Ioannis},
year = {2024},
booktitle={2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)}
}