Fix for Pascal NaN redux #408

Ph0rk0z · 2023-05-17T13:14:33Z

Force push over-rode but it isn't fixed.

I have tried this and it works. Please test it on your other cards.

All credit goes to: richardwth

fixes: #165

Force push over-rode but it isn't fixed.

guyman624 · 2023-05-17T17:21:14Z

Doesn't seem to work on my Tesla P40 card, unless oogabooga webui has some other underlying issue as well?
How can I confirm weather the issue is the webui or bnb?
I am getting RuntimeError: expected scalar type Half but found Float while trying to use 8-bit mode.

I built your patch-1 repository by running
CUDA_HOME=~/local/cuda-11.8 CUDA_VERSION=118 make cuda11x
then
sudo python3 setup.py install

guyman624 · 2023-05-17T19:10:17Z

Nevermind, its a bnb issue. I found the 8bit_test.py from that issue you linked and get the same RuntimeError: probability tensor contains either inf, nan or element < 0

Ph0rk0z · 2023-05-18T11:53:28Z

The half vs float is something else. I tested this when doing inference from said webui. You built for all arch, I don't think it's default.

0cc4m's script tests HW matmul first and that SHOULD fail.. I will give it a try and see what happens.

This script, you mean? https://gist.github.com/0cc4m/a753b6a16a618cfbe747a74920dc50f6

Reading it, it also patches BnB.. that and is for a much previous version.

Ph0rk0z · 2023-05-18T12:52:37Z

I did some testing.

Load the model:

Inference:

Output generated in 17.68 seconds (2.60 tokens/s, 46 tokens, context 71, seed 572183632)
Output generated in 4.18 seconds (2.16 tokens/s, 9 tokens, context 68, seed 1482993057)

Training:

INFO:Loading raw text file dataset...
INFO:Getting model ready...
INFO:Prepping for training...
INFO:Creating LoRA model...
INFO:Starting training...
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.14.2
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
{'train_runtime': 734.5184, 'train_samples_per_second': 8.659, 'train_steps_per_second': 0.065, 'train_loss': 2.2761232058207193, 'epoch': 0.18}
INFO:LoRA training run is completed and saved.
INFO:Training interrupted.

It's slightly faster using the adamw_bnb_8bit optimizer. On this gen of card it will never be super great due to the lack of HW matmul... but hey, us and the $3k V100 people are in the same boat. 💯

github-actions · 2023-12-20T15:17:13Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

TimDettmers · 2024-01-01T17:28:19Z

One change is good, the other would be a degradation in speed. Lets discuss how to fix this while maintaining speed for other GPUs.

Ph0rk0z · 2024-01-01T19:15:48Z

Haven't really compared this recently with the new codebase. In one of the issues a commenter said that only the first change was required.

Titus-von-Koeller · 2024-02-29T14:30:04Z

@Ph0rk0z I would love to make this PR actionable somehow but somehow I'm still struggling to understand what Tim means.

According to him one of the changes is good to go and the other "that adds lines which do 16 bit computation by casting the entire matrix to 16 bit that is more inefficient in many cases" needs improvement. Do you know what exactly he means and is this something we could wrap up together?

Maybe we can already merge the change that's good to go and handle the other one separately?

Force push over-rode but it isn't fixed.
What do you mean by that? Is the commit in the PR the only thing to consider or is there something missing?

matthewdouglas · 2024-02-29T20:20:36Z

I think I understand the change in forward() but I'm struggling to understand what I see in backward().

The change in forward is at least limited in surface area to GPUs from Volta and older, so I suspect this is the part that is "good."

Has anyone run the unit tests with these changes?

Ph0rk0z · 2024-03-01T23:04:37Z

Can be tried with just the fwd change to see if it still NaNs. I think people were saying it worked. I basically moved to GPTQ/GGUF and this languished a while so haven't been paying attention and re-testing. My bad. Sat so long I didn't think it would be accepted.

Ph0rk0z · 2024-03-04T11:30:27Z

I removed backwards pass so people can try it. Haven't had time to test on my machine yet, I'm down to my P6000 and P100 here.

Fix for Pascal NaN redux

89a531a

Force push over-rode but it isn't fixed.

github-actions bot closed this Dec 30, 2023

TimDettmers reopened this Jan 1, 2024

TimDettmers added medium priority (will be worked on after all high priority issues) Low Risk Risk of bugs in transformers and other libraries labels Jan 1, 2024

Titus-von-Koeller self-assigned this Feb 29, 2024

Titus-von-Koeller added the waiting for info label Feb 29, 2024

Remove backwards pass.

7f291f7

Titus-von-Koeller force-pushed the main branch 2 times, most recently from 9b72679 to 7800734 Compare July 27, 2024 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for Pascal NaN redux #408

Fix for Pascal NaN redux #408

Ph0rk0z commented May 17, 2023

guyman624 commented May 17, 2023

guyman624 commented May 17, 2023

Ph0rk0z commented May 18, 2023 •

edited

Loading

Ph0rk0z commented May 18, 2023

github-actions bot commented Dec 20, 2023

TimDettmers commented Jan 1, 2024

Ph0rk0z commented Jan 1, 2024

Titus-von-Koeller commented Feb 29, 2024

matthewdouglas commented Feb 29, 2024

Ph0rk0z commented Mar 1, 2024

Ph0rk0z commented Mar 4, 2024

Fix for Pascal NaN redux #408

Are you sure you want to change the base?

Fix for Pascal NaN redux #408

Conversation

Ph0rk0z commented May 17, 2023

guyman624 commented May 17, 2023

guyman624 commented May 17, 2023

Ph0rk0z commented May 18, 2023 • edited Loading

Ph0rk0z commented May 18, 2023

github-actions bot commented Dec 20, 2023

TimDettmers commented Jan 1, 2024

Ph0rk0z commented Jan 1, 2024

Titus-von-Koeller commented Feb 29, 2024

matthewdouglas commented Feb 29, 2024

Ph0rk0z commented Mar 1, 2024

Ph0rk0z commented Mar 4, 2024

Ph0rk0z commented May 18, 2023 •

edited

Loading