Add noop detach for Nf4 tensor and enhance nf4 testing #40

rohan-varma · 2024-03-01T01:13:35Z

Adds preliminary torch dispatch support as prototyped by @drisspg and a no-op detach so that NF4Tensor can be registered as an nn.Parameter.
Enhances NF4tensor testing in torchao
Slight error msg enhancement in nf4tensor

Note: things like state_dict save/load, .parameters() returning expected data, etc are not addressed / in scope for this PR. We will need to ensure all of these work robustly as part of using this in torchtune as we'll need to load in base model parameters into this layer before quantizing (or quantize on the fly)

Not sure if stuff is covered OOTB w/CI, but tested locally w/ python test/modules/test_nf4_linear.py -v

cpuhrsch · 2024-03-01T01:18:31Z

torchao/modules/nf4_linear.py

+from torchao.dtypes.nf4tensor import NF4Tensor, linear_nf4
+
+
+class FrozenNF4Linear(nn.Linear):


Just out of curiosity: Why can't this be done by overwriting the weight of an nn.Linear layer?

As in like here

https://github.com/pytorch-labs/segment-anything-fast/blob/387488bc4c7ab2ae311fb0632b34cab5cbfbab78/segment_anything_fast/sparse.py#L28-L32

def apply_sparse(model): apply_fake_sparsity(model) for name, mod in model.named_modules(): if isinstance(mod, torch.nn.Linear): mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight))

I think the anwser is cause we don't define nn.linear, see comment above, but if it works then I agree we should

For TorchTune use cases, we may not want to overwrite every single linear layer in the model with a frozen NF4 linear. Basically we wanna offer maximum flexibility to users, where even if in QLoRA currently every base linear is overwritten, they might wanna play around with this (for example making more granular tradeoff by only quantizing the qkv projections and not feed forwards).

Also, this won't work easily because our LoRA adapters are nn.Linears themselves - and we would not want to overwrite the LoRA adapeters.

cc @ebsmothers for thoughts on UX as well

Can you link your existing LoraLinears? I imagine the best UX for users/torchtune would what christian says, and you if you say "I want to qloraify the q projections" you would have a util that swaps the LoraLinear non adapter weight for an nf4 tensor and that should be all that is needed

@drisspg Thanks for the suggestion! Here is the LoRALinear in torchtune: https://github.com/pytorch-labs/torchtune/blob/2fba15d18d35383f7b8ad4dac5369ca6646ae68e/torchtune/modules/peft/lora.py#L37

I'd imagine such a UX would be -

def swap_nf4(): for module in llama.modules(): if isinstance(module, LoRALinear): # quantize to NF4 module.weight = NF4.from_tensor(module.weight)

Although, as opposed to module / parameter swapping, torchtune prefers to use a componentized builder approach where we build up models such as llama by plugging in the right nn.Module components depending on the config - nn.Linear for regular llama, LoRALinear for LoRA, and now NF4Linear for QLoRA. See an example of the builder pattern here: https://github.com/pytorch-labs/torchtune/blob/main/torchtune/models/llama2/_lora_llama2_builders.py#L135

Another point is that IMO NF4Linear eliminates a lot of complexity around state_dict save/load. When load_state_dict, I'm not sure whether there will be issues loading into a class that uses NF4 tensors. But if we have a specific NF4Linear that uses these NF4Tensros, we can attach load pre and post hooks to upcast / downcast the tensors appropriately.

For example, here's what a load_state_dict for QLoRA might look like:

def load_checkpoint_nf4(): # NOTE: this also enforces that ALL linear layers will always be quantized with QLoRA. # That might not always be the case if users want to customize and for example only # quantize some layers. # Convert all NF4s to their original weight for module in model.modules(): if isinstance(module, nn.Linear): module.weight = module.weight.get_original_weight() # would have to add support for bias as well load_checkpoint(model) # Now re-quantize for module in model.modules(): if isinstance(module, nn.Linear): module.weight = NF4Tensor.from_tensor(module.weight.data).to(module.weight.device)

This is of course assuming we quantize before loading state_dict, if we quantize after that could bring down the complexity.

At least for my mental model it does make sense to inherit from nn.Linear here (as opposed to swapping out self.weight from an nn.Linear, though that is nice and simple). But FrozenNF4Linear is basically a constrained version of nn.Linear, right? The weight is a particular tensor subclass, and we also require that there be no gradient. So imo we should inherit as a way of being explicit about these constraints. That way I can look at an FrozenNF4Linear and know that these conditions should hold, as opposed to having to try and figure them out across all my nn.Linears

cpuhrsch · 2024-03-01T01:19:54Z

test/modules/test_nf4_linear.py

+    def test_frozen_nf4_linear(self):
+        nf4_linear = FrozenNF4Linear(512, 512, device='cpu', dtype=torch.bfloat16)
+        self.assertTrue(isinstance(nf4_linear.weight, NF4Tensor))
+        self.assertEqual(torch.bfloat16, nf4_linear.weight.get_original_weight().dtype)


If you keep both, won't that affect memory consumption? If the user wants both, they can decide to keep both around. Otherwise they could convert and re-assign an nn.Linear.weight Tensor like here

def apply_sparse(model): apply_fake_sparsity(model) for name, mod in model.named_modules(): if isinstance(mod, torch.nn.Linear): mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight))

Oh, I see. Maybe a more standard API would be to support to(torch.bfloat16) or such?

If you keep both, won't that affect memory consumption

So IIUC we aren't keeping both here, but can verify via looking at the memory allocation after creating an instance of FrozenNF4Linear.

get_original_weight actually runs the dequantization and restores the original weight - maybe a bit of a misnomer since the name sorta implies its stored somewhere and is just accessed.

yeah maybe "build_original_weight"

But can you actually restore the original weight? Hasn't some fidelity been lost after converting to nf4? Hence my suggestion to just overwrite to_dtype here.

That makes sense, it does seem we lose fidelity and we can't get the exact original weight as intuitively expected -

(@drisspg - just checking my understanding is correct).

So should we update NF4Tensor to get rid of get_original_weight and just have a .to() API? @drisspg, can this be done in a separate PR?

That is Christian suggestions not mine lol, but regardless I think this PR markedly increases the testing coverage, and like you said if we want to do the full switch over to subclasses and do everything through torch dispatch I think it would make sense to do in a follow up PR, cc @cpuhrsch

cpuhrsch · 2024-03-01T01:26:07Z

torchao/modules/nf4_linear.py

+        # types.
+
+    def forward(self, input: Tensor) -> Tensor:
+        return linear_nf4(input=input, weight=self.weight)


I'm surprised you can't just put this into the usual F.linear.

Maybe it's worth updating the ops table https://github.com/pytorch-labs/ao/blob/687f0f0eae8594f90afc447e0b5b52b524cb3fa6/torchao/dtypes/nf4tensor.py#L417-L439

When I first wrote this, I didn't make this a subclass because it didn't support compile, I think I left a comment somewhere that we should likely do this and make sure just make sure that we indeed get the right thing saved for backwards

If I understand the 2 options correctly, its:

Use torch_dispatch mechanism to "correctly" implement F.linear in this case, where "correctly" means saving the right tensors and avoiding saving extra tensors for the backward pass.

Stick with the current autograd function implementation.

The reason I'm a bit of a proponent of sticking w/the autograd function implementation is because it's a bit more battle tested by @drisspg and the torch_dispatch support is a relatively new introduction. Could also switch over to this in the future.

WDYT @drisspg @cpuhrsch ?

drisspg/transformer_nuggets#24

its friday so I there might be an "easy" way to fix this but will leave these as a future me thing

cpuhrsch · 2024-03-01T01:26:39Z

torchao/modules/nf4_linear.py

+        del self.weight
+        self.weight = torch.nn.Parameter(self.nf4_weight, requires_grad=False)
+
+        # TODO: likely need to handle state_dict save & load via hooks to properly manage


NF4Tensor might already support that as a Tensor subclass

awesome! Will probably test this out as part of follow up work. The main thing I wanna figure out is if we call load_state_dict w/base model parameters in bf16, and try to load into NF4Tensor, do we crash, raise type mismatch issue, or just in time quantize the incoming weight and update the data. Will probably learn about this more when I begin the state_dict experimentation.

cpuhrsch · 2024-03-01T01:28:43Z

torchao/modules/nf4_linear.py

+        if self.weight.dtype != torch.bfloat16:
+            raise RuntimeError("FrozenNF4Linear is only supported with bf16 parameter currently")
+
+        self.nf4_weight = NF4Tensor.from_tensor(self.weight.data).to(device).to(dtype)


I wonder if there could also be use to like a to_nf4 factory function. Then it'd follow the pattern of torch.Tensor.to(<torch dtype>), but as a standalone function (which it'll have to be unless we somehow open up dtypes for open registration).

It's then quite similar to other memory/dtype/device oriented functions. In a nutshell, just because we now use nf4 instead of bfloat16, the Tensor's behavior etc. hasn't changed (of course individual values might have changed since nf4 has a different range etc.).

This makes sense. Is this something we'd like to build in this PR or more as a longer-term follow up item? cc @drisspg

TBH I don't know if I follow this

drisspg · 2024-03-01T01:48:24Z

test/modules/test_nf4_linear.py

+        bnb_nf4_linear = self._build_bnb_linear(input_weight=orig_weight)
+
+        inp = torch.randn(2, 512, dtype=torch.bfloat16, device='cuda')
+        self.assertEqual(nf4_linear(inp).sum(), bnb_nf4_linear(inp).sum())


IMO this is kinda a weird test I think reconstruction accuracy is better;

compared to original: https://github.com/drisspg/transformer_nuggets/blob/f05afad68ad9086d342268f46a7f344617a02314/test/test_qlora.py#L45

compared against bnb: "make sure we arent worse"
https://github.com/drisspg/transformer_nuggets/blob/f05afad68ad9086d342268f46a7f344617a02314/test/test_qlora.py#L65

Will def add reconstruction accuracy test. Curious why it's not as valuable to test exact parity w/BNB though?

ebsmothers · 2024-03-02T01:18:49Z

torchao/modules/nf4_linear.py

+        if self.weight.dtype != torch.bfloat16:
+            raise RuntimeError("FrozenNF4Linear is only supported with bf16 parameter currently")


nit: can't you just check self.dtype first before even initializing the parent class?

drisspg

Looks good!

Add noop detach for Nf4 tensor and enhance nf4 testing

rohan-varma requested a review from drisspg March 1, 2024 01:13

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 1, 2024

rohan-varma force-pushed the nf4_linear branch from e7f4c2b to 687f0f0 Compare March 1, 2024 01:17

cpuhrsch reviewed Mar 1, 2024

View reviewed changes

drisspg reviewed Mar 1, 2024

View reviewed changes

rohan-varma force-pushed the nf4_linear branch from 687f0f0 to 81f3d38 Compare March 1, 2024 22:25

rohan-varma requested review from cpuhrsch and drisspg March 1, 2024 22:25

rohan-varma force-pushed the nf4_linear branch 2 times, most recently from 1cbabc2 to 87740d0 Compare March 1, 2024 23:03

ebsmothers reviewed Mar 2, 2024

View reviewed changes

rohan-varma force-pushed the nf4_linear branch from 87740d0 to 233efd3 Compare March 5, 2024 19:35

rohan-varma changed the title ~~Add Nf4Linear and tests~~ Add noop detach for Nf4 tensor and enhance nf4 testing Mar 5, 2024

rohan-varma requested a review from ebsmothers March 5, 2024 19:37

Add Nf4Linear and tests

39fde91

rohan-varma force-pushed the nf4_linear branch from 233efd3 to 39fde91 Compare March 5, 2024 19:41

cpuhrsch approved these changes Mar 5, 2024

View reviewed changes

rohan-varma merged commit c9b397d into main Mar 5, 2024
2 checks passed

drisspg approved these changes Mar 5, 2024

View reviewed changes

dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024

Merge pull request pytorch#40 from pytorch-labs/nf4_linear

4ff119c

Add noop detach for Nf4 tensor and enhance nf4 testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add noop detach for Nf4 tensor and enhance nf4 testing #40

Add noop detach for Nf4 tensor and enhance nf4 testing #40

rohan-varma commented Mar 1, 2024 •

edited

Loading

cpuhrsch Mar 1, 2024

drisspg Mar 1, 2024

rohan-varma Mar 1, 2024

drisspg Mar 1, 2024

rohan-varma Mar 1, 2024

rohan-varma Mar 1, 2024

rohan-varma Mar 1, 2024

ebsmothers Mar 2, 2024

cpuhrsch Mar 1, 2024

cpuhrsch Mar 1, 2024

rohan-varma Mar 1, 2024

drisspg Mar 1, 2024

cpuhrsch Mar 1, 2024

rohan-varma Mar 1, 2024

drisspg Mar 1, 2024

cpuhrsch Mar 1, 2024

drisspg Mar 1, 2024

rohan-varma Mar 1, 2024

drisspg Mar 2, 2024

cpuhrsch Mar 1, 2024

rohan-varma Mar 1, 2024

cpuhrsch Mar 1, 2024 •

edited

Loading

rohan-varma Mar 1, 2024

drisspg Mar 1, 2024

drisspg Mar 1, 2024

rohan-varma Mar 1, 2024

ebsmothers Mar 2, 2024

drisspg left a comment

		from torchao.dtypes.nf4tensor import NF4Tensor, linear_nf4


		class FrozenNF4Linear(nn.Linear):

		if self.weight.dtype != torch.bfloat16:
		raise RuntimeError("FrozenNF4Linear is only supported with bf16 parameter currently")

Add noop detach for Nf4 tensor and enhance nf4 testing #40

Add noop detach for Nf4 tensor and enhance nf4 testing #40

Conversation

rohan-varma commented Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drisspg left a comment

Choose a reason for hiding this comment

rohan-varma commented Mar 1, 2024 •

edited

Loading

cpuhrsch Mar 1, 2024 •

edited

Loading