Optimizing order of operations #310

vtabbott · 2024-03-04T05:55:32Z

vtabbott
Mar 4, 2024

I am working on implementing SMoE for Mixtral and have found the following bug. When performing einops.einsum with multiple tensors at once, the code gives a bug at large batch sizes (w=9*K and above), giving an error that it attempts to allocate 1008 GiB of memory. This scales linearly, so w=18*K gives 2016 GiB. However, values w=8*K and below properly execute, even though they "should" be trying to assign equally unreasonable amounts of memory. When I implement the matrix operations separately, the code executes without memory errors, even with very large w=96*K values.

Could the multiple tensor memory / arrangement algorithm be improved to solve this error?

Cheers.

import torch
from torch import nn
import time
import einops
import torch.nn.functional as F

# Set up utilities
t_start = None
def t0():
    global t_start
    t_start = time.time()

def t1(name = "", g = None):
    t_delta = (time.time()-t_start)
    if g != None:
        t_delta = g(t_delta)
    print(f"{name}\t{1e3*t_delta} ms")

# Load on GPU
args = {"device":"cuda:0", "dtype":torch.float16, "requires_grad":False}

# Setup variables
K = 1024
w = 96*K
# One Expert MoE
xs     = torch.rand(w, 4096,        **args) # w m
L_gate = torch.rand(   4096, 1,     **args)
W_act  = torch.rand(1, 4096, 14336, **args)
W_in   = torch.rand(1, 4096, 14336, **args)
W_out  = torch.rand(1, 14336, 4096, **args)
act    = torch.nn.SiLU()

# Perform Matrix Multiplications
ni = 0 # This expert
gates = F.sigmoid(einops.einsum(
    xs, 
    L_gate, 
    "x m, m n -> x n"))
xs_act= F.silu(einops.einsum(
    xs, 
    W_act[ni], 
    "x m, m f -> x f"))


# Do the individual matrix multiplications
torch.cuda.synchronize()

t0()
x_delta = einops.einsum(
    xs,
    W_in[ni],
    "x m, m f -> x f"
)
x_delta = einops.einsum(
    x_delta,
    xs_act,
    "x f, x f -> x f"
)
x_delta = einops.einsum(
    x_delta,
    W_out[ni],
    "x f, f m1 -> x m1"
)
x_delta = einops.einsum(
    x_delta,
    gates[:, 0],
    "x m1, x -> x m1"
)
torch.cuda.synchronize()
t1("separated")

t0()
x_delta = einops.einsum(
    xs,
    W_in[ni],
    xs_act,
    W_out[ni],
    gates[:, 0],
    "x m, m f, x f, f m1, x -> x m1"
)
torch.cuda.synchronize()
t1("together")

arogozhnikov · 2024-03-04T07:09:17Z

arogozhnikov
Mar 4, 2024
Maintainer

Could the multiple tensor memory / arrangement algorithm be improved to solve this error?

einops.einsum is merely a facade to torch.einsum in your case.

AFAIR right now torch.einsum includes opt_einsum and I assume by default optimizes order of execution.

If there are any problems in your code with memory allocation, they almost certainly happen in the last einsum.

I'd recommend

run memory profiling
minimize example to one operation
compare with "manually implemented code" to confirm problem in execution order
And report this to torch team.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing order of operations #310

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Optimizing order of operations #310

vtabbott Mar 4, 2024

Replies: 1 comment

arogozhnikov Mar 4, 2024 Maintainer

vtabbott
Mar 4, 2024

arogozhnikov
Mar 4, 2024
Maintainer