work in progress about fusion layer #2

kwonjihun-theori · 2024-04-30T13:35:47Z

I attempted to implement the fusion layer. Although it seems to have been implemented plausibly, the result only generates meaningless tokens, and the logits values from model(**inputs) contain only NaN values.
As far as I know, memory usage should increase a little after fusing, but I haven't observed any increase in memory usage either.

It seems like something is wrong, but it might be a minor issue like when I implemented quantize before. Although it's still incomplete, I thought it might be helpful, so I'm submitting a PR.

I'm not sure if this process is possible, but I have enough hardware to conduct tests. If you modify the code, I can test it.

I would also like to contribute to this AutoAWQ project.

TechxGenus · 2024-04-30T16:44:41Z

Thanks for the commit. The framework is correct and should work after some modifications.

TechxGenus · 2024-04-30T16:53:24Z

awq/modules/fused/block.py

+            qkv_layer,
+            o_proj,
+            dev=dev,
+            max_seq_len=max_seq_len,


Its RoPE implementation is different from Llama's. Require some modifications in attn.py to get everything working properly.

ok others edit is done.

I will edit attn.py!

I think this is too difficult task for me.

First, I checked the difference in the implementation of rotary posembedding between cohere and llama.

llama is just torch.cat after frequs operation, and cohere is torch.repeat_interleave.

Also, when llama proceeds with rotary_half
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]

to proceed,

cohere is
x1 = x[...,::2]
x2 = x[..., 1::2]

There is a difference in that.

However, I'm not sure how this should work with RoPE in awq.

Can I get any ideas or hints?

Aha, I think using a staggered rope should be fine. Now I have device to modify it.

TechxGenus · 2024-04-30T16:53:30Z

awq/models/cohere.py

+                module.self_attn.k_proj,
+                module.self_attn.v_proj,
+            )
+            norm_1 = FasterTransformerRMSNorm(


It use normal layernorm

norm_1 = module.input_layernorm

TechxGenus · 2024-04-30T16:53:43Z

awq/modules/fused/block.py

+        )
+
+        h = hidden_states.to(attn_output.device) + attn_output
+        out = h + self.mlp.forward(h)


out = h + self.mlp.forward(norm_out)

TechxGenus · 2024-05-23T06:56:08Z

Thanks!

work in progress about fusion layer

2f96aa5

TechxGenus reviewed Apr 30, 2024

View reviewed changes

edit layernorm

0fac2a1

TechxGenus merged commit d9f1d18 into TechxGenus:add_cohere_support May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

work in progress about fusion layer #2

work in progress about fusion layer #2

kwonjihun-theori commented Apr 30, 2024 •

edited

Loading

TechxGenus commented Apr 30, 2024

TechxGenus Apr 30, 2024

kwonjihun-theori May 1, 2024

kwonjihun-theori May 6, 2024

TechxGenus May 23, 2024

TechxGenus Apr 30, 2024

TechxGenus Apr 30, 2024

TechxGenus commented May 23, 2024

work in progress about fusion layer #2

work in progress about fusion layer #2

Conversation

kwonjihun-theori commented Apr 30, 2024 • edited Loading

TechxGenus commented Apr 30, 2024

TechxGenus Apr 30, 2024

Choose a reason for hiding this comment

kwonjihun-theori May 1, 2024

Choose a reason for hiding this comment

kwonjihun-theori May 6, 2024

Choose a reason for hiding this comment

TechxGenus May 23, 2024

Choose a reason for hiding this comment

TechxGenus Apr 30, 2024

Choose a reason for hiding this comment

TechxGenus Apr 30, 2024

Choose a reason for hiding this comment

TechxGenus commented May 23, 2024

kwonjihun-theori commented Apr 30, 2024 •

edited

Loading