[llama] Use horizontal fusion trick from Attention for FeedForward #606

cpuhrsch · 2024-08-06T17:32:26Z

For the Attention module we can concatenate the weights and do one instead of three GEMMs for the input to gain a speedup, because each GEMM will be applied to the same input.

ao/torchao/_models/llama/model.py

Lines 220 to 225 in 22d6f97

    
           def load_hook(self, state_dict, prefix, *args): 
        
               if prefix + "wq.weight" in state_dict: 
        
                   wq = state_dict.pop(prefix + "wq.weight") 
        
                   wk = state_dict.pop(prefix + "wk.weight") 
        
                   wv = state_dict.pop(prefix + "wv.weight") 
        
                   state_dict[prefix + "wqkv.weight"] = torch.cat([wq, wk, wv])

and

ao/torchao/_models/llama/model.py

Lines 230 to 231 in 22d6f97

    
           kv_size = self.n_local_heads * self.head_dim 
        
           q, k, v = self.wqkv(x).split([self.dim, kv_size, kv_size], dim=-1)

I suspect we can do the exact same thing for FeedFoward

ao/torchao/_models/llama/model.py

Lines 262 to 263 in 22d6f97

    
           def forward(self, x: Tensor) -> Tensor: 
        
               return self.w2(F.silu(self.w1(x)) * self.w3(x))

Task:
Implement the above trick and rerun the benchmarks to show gains. If you don't have access to an A100, another (ideally similar) GPU is fine too as a proxy. Also, if you can, try to confirm via a trace that indeed two GEMMs have been turned into one.

sayakpaul · 2024-08-31T07:17:53Z

suspect we can do the exact same thing for FeedFoward

How would you account for silu here?

self.w2(F.silu(self.w1(x)) * self.w3(x))

Something like this could work:

class FusedOperation(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fused_weight = nn.Parameter(torch.randn(output_dim, input_dim * 2))
        self.bias = nn.Parameter(torch.zeros(output_dim))

    def forward(self, x):
        # Split the fused weight into two parts
        w1_w3, w2 = self.fused_weight.chunk(2, dim=1)
        
        # Compute the fused operation
        hidden = F.silu(x @ w1_w3[:, :x.size(-1)].t()) * (x @ w1_w3[:, x.size(-1):].t())
        return hidden @ w2.t() + self.bias

Or do you have a simpler alternative in mind?

cpuhrsch · 2024-08-31T07:21:00Z

@sayakpaul Oh, I mean

x1, x3 = self.w13(x).split([...])

As in, just fuse w1 and w3 not all of w1, w2 and w3. Similar to how wqkv fuses wq, wk and wv, but leaves the output projections (wo) alone.

So more specifically

x1, x3 = self.w13(x).split([...])
return self.w2(F.silu(x1) * x3)

Right now

self.w2(F.silu(self.w1(x)) * self.w3(x))

will cause 3 calls to an nn.Linear, but with the above change it's 2 calls and also F.silu(x1) * x3 can become an epilogue of w13(x) if using torch.compile.

Essentially you stack w1 and w3 horizontally like

[w1,
 w3] @ x

instead of

[w1 @ x,
 w3 @ x]

but you can split the result of the former (and do so without causing a copy, because striding).

…ytorch#606) * Automatically identify cuda from nvidia-smi in install-requirements * Update README.md --------- Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>

* executable README * fix title of CI workflow * markup commands in markdown * extend the markup-markdown language * Automatically identify cuda from nvidia-smi in install-requirements (pytorch#606) * Automatically identify cuda from nvidia-smi in install-requirements * Update README.md --------- Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> * Unbreak zero-temperature sampling (pytorch#599) Fixes pytorch#581. * Improve process README * [retake] Add sentencepiece tokenizer (pytorch#626) * Add sentencepiece tokenizer Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Add white space Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle white space: Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Handle control ids Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * More cleanup Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Lint Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use unique_ptr Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Use a larger runner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Debug Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Cleanup * Update install_utils.sh to use python3 instead of python (pytorch#636) As titled. On some devices `python` and `python3` are pointing to different environments so good to unify them. * Fix quantization doc to specify dytpe limitation on a8w4dq (pytorch#629) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Co-authored-by: Kimish Patel <kimishpatel@fb.com> * add desktop.json (pytorch#622) * add desktop.json * add fast * remove embedding * improvements * update readme from doc branch * tab/spc * fix errors in updown language * fix errors in updown language, and [skip]: begin/end * fix errors in updown language, and [skip]: begin/end * a storied run * stories run on readme instructions does not need HF token * increase timeout * check for hang un hf_login * executable README improvements * typo * typo --------- Co-authored-by: Ian Barber <ian.barber@gmail.com> Co-authored-by: Scott Wolchok <swolchok@meta.com> Co-authored-by: Mengwei Liu <larryliu0820@users.noreply.github.com> Co-authored-by: Kimish Patel <kimishpatel@fb.com> Co-authored-by: Scott Roy <161522778+metascroy@users.noreply.github.com>

msaroufim added the good first issue Good for newcomers label Aug 6, 2024

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llama] Use horizontal fusion trick from Attention for FeedForward #606

[llama] Use horizontal fusion trick from Attention for FeedForward #606

cpuhrsch commented Aug 6, 2024 •

edited

Loading

This comment was marked as resolved.

sayakpaul commented Aug 31, 2024

cpuhrsch commented Aug 31, 2024 •

edited

Loading

[llama] Use horizontal fusion trick from Attention for FeedForward #606

[llama] Use horizontal fusion trick from Attention for FeedForward #606

Comments

cpuhrsch commented Aug 6, 2024 • edited Loading

This comment was marked as resolved.

sayakpaul commented Aug 31, 2024

cpuhrsch commented Aug 31, 2024 • edited Loading

cpuhrsch commented Aug 6, 2024 •

edited

Loading

cpuhrsch commented Aug 31, 2024 •

edited

Loading