CvT: Introducing Convolutions to Vision Transformers

This repository contains the implementation of the paper CvT: Introducing Convolutions to Vision Transformers using PyTorch.

Model Architecture

The CvT model introduces convolutions to the Vision Transformer architecture. The overall architecture is shown below:

Overall Workflow

Convolutional Projection

The key components of CvT are:

Convolutional Token Embedding: This module reshapes the 2D input image into a sequence of 1D tokens, similar to ViT, but uses a convolutional layer instead of a linear projection. This allows the model to learn local spatial context.
Convolutional Transformer Block: This block replaces the linear projections in the multi-head attention (MHA) module with depth-wise separable convolutions. This allows the model to capture local spatial context and reduces the number of parameters.

The model architecture details are shown in the paper.

Dataset

This repository uses the Oxford-IIIT Pet Dataset. The dataset contains 37 species of dogs and cats, with 200 images for each species. The images have a large variation in scale, pose, and lighting. The dataset is split into training, validation, and test sets.

Usage

To use this repository, you need to install the dependencies listed in pyproject.toml. You can do this by running:

poetry install

Then, you can run the train.ipynb notebook to train the model.

Results

The learning curve for the CvT-13 model is shown below:

Issues

You can see the learning curve isn't in its most desirable shape. The paper first trains the model on HUGE datasets, then transfer it to smaller datasets, like the Oxford Pet dataset. However, I had to train this on my 6GB VRAM NVIDIA GPU, making that impossible. As a result, overfitting was un-avoidable. And loss didn't go down below 3.0.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data/annotations		data/annotations
media		media
.gitignore		.gitignore
CvT.py		CvT.py
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CvT: Introducing Convolutions to Vision Transformers

Model Architecture

Overall Workflow

Convolutional Projection

Dataset

Usage

Results

Issues

References

About

Uh oh!

Releases

Packages

Languages

License

kmsrogerkim/CvT-PyTorch

Folders and files

Latest commit

History

Repository files navigation

CvT: Introducing Convolutions to Vision Transformers

Model Architecture

Overall Workflow

Convolutional Projection

Dataset

Usage

Results

Issues

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages