Expanding Cross-Architectural Self-supervision for Multi-modal learning

This Repository contains the implementation of Final Project as part of the Deep Learning Course , NYU.

Overview

Memes have become an online phenomenon and have grown in circulation with increased accessibility to the internet and editorial applications. This floods the internet with many memes and makes direct human supervision impossible. A plausible solution is to use AI to determine whether the memes are harmful. But sometimes, these memes subtly combine visual and textual cues to point to hate speech. Traditional deep learning architectures and techniques rely on uni-modality i.e., they only focus on images or textual data. But as mentioned earlier, sometimes memes use a combination of textual and visual cues for hate speech; this calls for deep learning techniques to be multimodal in their approach. Furthermore, human intelligence also relies on multiple modalities. Another bottleneck is in the form of the availability of annotations. As supervised techniques rely on input-label pairs to learn meaningful representations, scaling them is arduous and expensive. This is where we could depend on self-supervised learning to understand representations label-freely. So taking the above problem as the inception point, in this report, we have implemented two unimodal and two multimodal self-supervised approaches. In addition, we extend the implementation of CASS to multiple modalities and suggest a novel multimodal self-supervised technique CASS-MM (CASS-Multi-modal).

Methodolgies Explored

CLIP

Model Structure

Numpy-like pseudocode for the core of an imple- mentation of CLIP.

Reference Paper Link : https://arxiv.org/pdf/2103.00020.pdf

DINO

Model Structure

Reference Paper Link : https://arxiv.org/pdf/2104.14294.pdf

CASS

Model Structure

Reference Paper Link : https://arxiv.org/pdf/2206.04170.pdf

Results

Metrics used for Evaluation:

For our experiments, we have chosen two evaluation met- rics, namely AUROC and Accuracy.

Unimodal Approach

Results obtained while comparing comparing the results of supervised- only ResNet50 with DINO and CASS unimodally trained ResNet50, for two different input sizes - 224 and 384. We observed that CASS-trained ResNet50 outperformed super- vised and DINO-trained ResNet50 for both image sizes. And overall we saw an improvement of 2% and 3% over super- vised ResNet50 for image sizes 224 and 384 respectively.

Model	AUROC	Accuracy(%)
ResNet-50 224	0.5157±0.081	0.5175±0.044
ResNet-50 224 (CASS)	0.5398±0.091	0.5435±0.031
ResNet-50 224 (DINO)	0.5129±0.009	0.5267±0.055
ResNet-50 384	0.5116±0.076	0.511±0.016
ResNet-50 384 (CASS)	0.5405±0.045	0.5475±0.015
ResNet-50 384 (DINO)	0.532±0.033	0.543±0.056

Results obtained while comparing the results of supervised-only ViT Base 16 with DINO and CASS unimodally trained ViT Base 16, for two different input sizes - 224 and 384. Again we see that CASS trined ViT Base 16 improves upon supervised and DINO-trained ViT Base 16 by a smaller margin for image size 224 but by around 3% on AUROC for image size 384.

Model	AUROC	Accuracy(%)
ViT Base 16 224	0.51293±0.043	0.5235±0.065
ViT Base 16 (CASS)	0.5196±0.051	0.5545±0.01
ViT Base 16(DINO)	0.513±0.077	0.5195±0.014
ViT Base 16 384	0.5±0.022	0.545±0.087
ViT Base 16 384 (CASS)	0.53386±0.009	0.59±0.017
ViT Base 16 384 (DINO)	0.5002±0.011	0.624±0.071

Multimodal Approach

Results obtained while comparing our novel CASS MM (Multi-Modal) self-supervised technique against CLIP for input image size 384. We picked image size 384 from above results we observed that input size 384 performed better than 224.

Model	AUROC	Accuracy(%)
CLIP (ViTB/16)	0.5761±0.011	0.5915±0.031
CASS MM (ViTB/16)	0.5518±0.05	0.6335±0.012
CASS MM (ResNet50)	0.5518±0.05	0.5817±0.087

Plots and Graphs

Below diagram, shows the loss vs epochs on training set for CASS Multimodal trained CNN and Transformer for 25 epochs.

Respository Details

Below are the details of the files in this repository.

For the code used in Pre-training, please refer to the Python Notebooks in the below folders:

└── CASS // Implementation of CASS with Unimodal configuration
└── CASS_MM  // Implementation of CASS with Multimodal configuration
└── CLIP // Implementation of CLIP
└── DINO  // Implementation of DINO

For code the used in Finetuning, please refer to the Python Notebook in the below folder:

└── Finetuning

Code References:

https://github.com/moein-shariatnia/OpenAI-CLIP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Expanding Cross-Architectural Self-supervision for Multi-modal learning

Overview

Methodolgies Explored

CLIP

Model Structure

DINO

Model Structure

CASS

Model Structure

Results

Metrics used for Evaluation:

Unimodal Approach

Multimodal Approach

Plots and Graphs

Respository Details

Code References:

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
CASS		CASS
CASS_MM		CASS_MM
CLIP		CLIP
DINO		DINO
Finetuning		Finetuning
images		images
README.md		README.md

MDivyaPrakash/MultiModalSelfSuperVision

Folders and files

Latest commit

History

Repository files navigation

Expanding Cross-Architectural Self-supervision for Multi-modal learning

Overview

Methodolgies Explored

CLIP

Model Structure

DINO

Model Structure

CASS

Model Structure

Results

Metrics used for Evaluation:

Unimodal Approach

Multimodal Approach

Plots and Graphs

Respository Details

Code References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages