CompactTransformers

My attempt at Re-Implementating Compact Transformer models from the paper 'Escaping the Big Data Paradigm with Compact Transformers' by Ali Hassani et al.

Model Architecture

Paper Description

Aim

Explore Vision Transformer based models using novel method that can be trained on less data
The authors have proposed three architectures, each one an improvement on the previous one
Namely, ViT-lite, Compact Vision Transformer, Compact Convolutional Transformer
CCT uses convolutions as the part of the tokenization steps which creates an inductive bias, so the patches preserves more spatial information
The authors also introduce a novel Sequence-Pooling layer which replaces the conventional class token design in Vision Transformer
This paper claims to do better and more computationally efficient than the previous transformer based models and is comparable with CNN based models on classification tasks

Methodology

Datasets

CIFAR-10

Major Components Implemented

ViT-Lite which includes the architecture of a Vision Transformer with lesser patch-size, transformer layers and MLP heads
Compact Vision Transformer This used the same architecture if ViT-lite, here we replace the class tokens with a Sequence Pooling layer
Compact Convolutional Transformer This is built on the of CVT, in the tokenization step, instead of sending the patches through the linear embedding layer, we send convolutions of the image as tokens

Results

Hyperparameters

Optimizer = AdamW
learning rate = 0.001
weight decay = 0.0001
batch size = 256
num epochs = 30
image size = 32
patch size = 6
projection dim = 64
num heads = 4
transformer layers = 8
mlp head units = [2048, 1024]
The training of ViT-Lite took about 120 minutes on Google Colaboratory's TPU and got an accuracy of 59.18 and validation accuracy of 60. CVT took 100 minutes to train with an accuracy of 61 and validation accuracy of 60 and a test accuracy of 59.29 The training of CCT took 200 minutes for training on Google and got an accuracy of 84.24 and validation accuracy of 77.64 and a test accuracy of 77.15.

Loss/ACC Vs Epochs for CCT

Future Scope

Further changes in the attention mechanism can be made to make the patches better represented such as using attention gradient rollout which ignores regions with low attention.
It will be interesting to use involutions instead of convolutions in the tokenization step and see how the performance has changed.\newline \newline
We can use different type of loss functions such as patch-wise contrastive loss, patch-wise mixing loss etc.

Citation Of Original Paper

@article {hassani2021escaping, 
	title        = {Escaping the Big Data Paradigm with Compact Transformers},
	author       = {Ali Hassani and Steven Walton and Nikhil Shah and Abulikemu, Abuduweili and Jiachen Li and Humphrey Shi},
	year         = 2021,
	url          = {https://arxiv.org/abs/2104.05704},
	eprint       = {2104.05704},
	archiveprefix = {arXiv},
	primaryclass = {cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Original_research_paper		Original_research_paper
paper_implementation		paper_implementation
README.md		README.md
Review_Paper.pdf		Review_Paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompactTransformers

Model Architecture

Paper Description

Aim

Methodology

Datasets

Major Components Implemented

Results

Hyperparameters

Loss/ACC Vs Epochs for CCT

Future Scope

Citation Of Original Paper

About

Releases

Packages

Languages

Shreyas-Bhat/CompactTransformers

Folders and files

Latest commit

History

Repository files navigation

CompactTransformers

Model Architecture

Paper Description

Aim

Methodology

Datasets

Major Components Implemented

Results

Hyperparameters

Loss/ACC Vs Epochs for CCT

Future Scope

Citation Of Original Paper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages