Calculate the number of params in the model:
- h = hidden size
- l = num_layers
- s = sequence length
- v = vocabulary size
$ python -c "h=1024; l=24; s=1024; v=50257; print(f'{l*(12*h**2 + 13*h) + v*h + s*h + 2*h >> 20}M')"
338M
For our scripts where we only care for Billions:
NHIDDEN=4096
NLAYERS=36
SEQ_LEN=512
VOCAB_SIZE=50257
python -c "h=$NHIDDEN; l=$NLAYERS; s=$SEQ_LEN; v=$VOCAB_SIZE; print(f'Model size: {(l*(12*h**2 + 13*h) + v*h + s*h + 2*h) / 10**9 :.0f}B')"
Full math for the above final formula: (num_heads is not really part of the calculations)
# Compute Embedding Parameters (Vocab + Position)
emb_params = (v * h) + (s * h)
# Compute Parameters per Transformer Block
head_dim = h / k
qkv_params_w = k * (3 * (h * (h / k))) = 3 * h * h # 3h^2
mh_reduce_w = (k * ((h / k)) * h = h * h # h^2
qkv_params_b = k * (3 * (h / k)) = 3 * h # 3h
mh_reduce_b = h # h
pos_ff_exp_w = h * (4 * h) # 4h^2
pos_ff_con_w = (4 * h) * h # 4h^2
pos_ff_exp_b = 4 * h # 4h
pos_ff_con_b = h # h
layer_norm1 = 2 * h # 2h
layer_norm2 = 2 * h # 2h
# Magic Formula:
total_params = n * (12h^2 + 13h) + (v * h) + (s * h) + 2*h
credits: Sidd Karamcheti
An estimate variations of this for large hidden size and number of layers (seq and vocab size have very small contribution)
NHIDDEN=4096
NLAYERS=36
python -c "h=$NHIDDEN; l=$NLAYERS; print(f'Model size: {(12*l*h**2) / 10**9 :.0f}B')"
credits: Mohammad Shoeybi
Can calculate the same on a given model
object (counts shared params once):
sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())