PyTorch Implementation of the Visual Transformer (ViT) from Scratch

Reimplementation of the paper:

"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", Dosovitskiy et al, 2020.

If you use the code of this repo and you find this project useful, please consider to give a star ⭐!

Usage

# Imports
import torch
from models.vit import ViT

# Create the model
vit = ViT(
    patch_size=4, 
    num_layers=2, 
    h_dim=256, 
    num_heads=8, 
    num_classes=10, 
    d_ff=2048, 
    max_time_steps=1000, 
    use_clf_token=True,
)

# Inference
model.eval()
x = torch.randn(1, 3, 32, 32) # [B, C, H, W]
logits = vit(x) # [B, N_CL]

Model Configurations

From the paper [link]:

Model	Layers	Hidden Size	MLP Size	Heads	Params
ViT-Base	12	768	3072	12	86 M
ViT-Large	24	1024	4096	16	307 M
ViT-Huge	32	1280	5120	16	632 M

Train

$ python main.py \
    --mode "train" \
    --model "vit-base" \
    --patch_size 8 \
    --lr 3e-4 \
    --epochs 100

Test

$ python main.py \
    --mode "test" \
    --model "vit-base" \
    --patch_size 8 \
    --model_checkpoint "./checkpoints/vit_base.ckpt"

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
models		models
.gitignore		.gitignore
config.py		config.py
dataloaders.py		dataloaders.py
main.py		main.py
readme.md		readme.md
vit.gif		vit.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Implementation of the Visual Transformer (ViT) from Scratch

Usage

Model Configurations

Train

Test

About

Languages

guglielmocamporese/visual-transformer-pytorch

Folders and files

Latest commit

History

Repository files navigation

PyTorch Implementation of the Visual Transformer (ViT) from Scratch

Usage

Model Configurations

Train

Test

About

Topics

Resources

Stars

Watchers

Forks

Languages