Skip to content

Latest commit

 

History

History
74 lines (67 loc) · 2.8 KB

README.md

File metadata and controls

74 lines (67 loc) · 2.8 KB

Introduction

MyViT is simplified version of rwightman/pytorch-image-models/timm/models/vision_transformer

This project aim to make easy to review code and the paper <An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale>

Equations

Transformer Encoder

$$\begin{aligned} (H, W) &= \text{the resolution of the original image}\ C &= \text{the number of channels}\ (P, P) &= \text{the resolution of each image patch}\ D &= \text{latent vector size}\ N' &= H \cdot W / P^2 = \text{the number of patches}\ N &= N' + 1 = \text{the Transformer’s sequence length}\ \ \mathrm{LN} &= \text{LayerNorm}\ \ &\textbf{Input}\ \mathbf{x}{p} &\in \mathbb{R}^{N' \times (P^2 \cdot C)}\ \ &\textbf{Learnable}\ \mathbf{E} &\in \mathbb{R}^{(P^2 \cdot C) \times D}\ \mathbf{E}{pos} &\in \mathbb{R}^{N \times D}\ \mathbf{x}{class} &\in \mathbb{R}^{1 \times D}\ \ \mathbf{z}{0} &= [\mathbf{x}{class}\ ;\ \mathbf{x}{p}\mathbf{E}]+\mathbf{E}{pos} &\mathbf{z}{0} &\in \mathbb{R}^{N \times D}\ \ \mathbf{z'}{l} &= \mathrm{MSA}(~\mathrm{LN}(\mathbf{z}{l-1}))+\mathbf{z}{l-1} &\mathbf{z'}{l} &\in \mathbb{R}^{N \times D}\ \mathbf{z}_{l} &= \mathrm{MLP}(\mathrm{LN}(\mathbf{z'}{l}))+~\mathbf{z'}{l} &\mathbf{z}{l} &\in \mathbb{R}^{N \times D}\ &\text{,where} \quad l = 1 \ldots L\ \ &\textbf{Output}&\ \mathbf{y} &= \mathrm{LN}(\mathbf{z}^{0}{L}) &\mathbf{y} &\in \mathbb{R}^{D}\ \end{aligned}$$

MSA (Multihead Self Attention)

$$\begin{aligned} h &= \text{number of heads}\ d &= D / h\ \ &\textbf{Input}\ \mathbf{z} &\in \mathbb{R}^{N \times D}\ \ &\textbf{Learnable}\ \mathbf{U}{qkv} &\in \mathbb{R}^{D \times (3 \cdot d)}\ \mathbf{U}{msa} &\in \mathbb{R}^{D \times D}\ \ [\mathbf{q, k, v}] &= \mathbf{zU}{qkv} &\mathbf{q, k, v} &\in \mathbb{R}^{N \times d}\ \ A &= \mathrm{softmax}(\ \mathbf{qk}^{\top}\ /\ \sqrt{d}\ ) &A &\in \mathbb{R}^{N \times N}\ \ \mathrm{SA}(\mathbf{z}) &= A\mathbf{v} &\mathrm{SA}(\mathbf{z}) &\in \mathbb{R}^{N \times d}\ \ &\textbf{Output}\ \mathrm{MSA}(\mathbf{z}) &= [\mathrm{SA}{1}(\mathbf{z}) ; \mathrm{SA}{2}(\mathbf{z}) ; \cdots ; \mathrm{SA}{h}(\mathbf{z})] \mathbf{U}_{msa} &\mathrm{MSA}(\mathbf{z}) &\in \mathbb{R}^{N \times D} \end{aligned}$$

MLP(Mulilayer Perceptron)

$$\begin{aligned} D_{hidden} &= \text{hidden layer size}\ \ &\textbf{Input}\ \mathbf{z} &\in \mathbb{R}^{N \times D}\ \ &\textbf{Learnable}\ \mathbf{L}{hidden} &\in \mathbb{R}^{D \times D{hidden}}\ \mathbf{L}{out} &\in \mathbb{R}^{D{hidden} \times D}\ \ &\textbf{Output}\ \mathrm{MLP}(\mathbf{z}) &= \mathrm{GELU}(\mathbf{zL}_{hidden})\mathbf{L}_{out} &\mathrm{MLP}(\mathbf{z}) &\in \mathbb{R}^{N \times D} \end{aligned}$$