Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Pascal NaN redux #408

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Ph0rk0z
Copy link

@Ph0rk0z Ph0rk0z commented May 17, 2023

Force push over-rode but it isn't fixed.

I have tried this and it works. Please test it on your other cards.

All credit goes to: richardwth

@richardwth

fixes: #165

Force push over-rode but it isn't fixed.
@guyman624
Copy link

Doesn't seem to work on my Tesla P40 card, unless oogabooga webui has some other underlying issue as well?
How can I confirm weather the issue is the webui or bnb?
I am getting RuntimeError: expected scalar type Half but found Float while trying to use 8-bit mode.

I built your patch-1 repository by running
CUDA_HOME=~/local/cuda-11.8 CUDA_VERSION=118 make cuda11x
then
sudo python3 setup.py install

@guyman624
Copy link

Nevermind, its a bnb issue. I found the 8bit_test.py from that issue you linked and get the same RuntimeError: probability tensor contains either inf, nan or element < 0

@Ph0rk0z
Copy link
Author

Ph0rk0z commented May 18, 2023

The half vs float is something else. I tested this when doing inference from said webui. You built for all arch, I don't think it's default.

0cc4m's script tests HW matmul first and that SHOULD fail.. I will give it a try and see what happens.

This script, you mean? https://gist.github.com/0cc4m/a753b6a16a618cfbe747a74920dc50f6

Reading it, it also patches BnB.. that and is for a much previous version.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented May 18, 2023

I did some testing.

Load the model:

ModelLoad

Inference:

Output generated in 17.68 seconds (2.60 tokens/s, 46 tokens, context 71, seed 572183632)
Output generated in 4.18 seconds (2.16 tokens/s, 9 tokens, context 68, seed 1482993057)

inference

Training:

INFO:Loading raw text file dataset...
INFO:Getting model ready...
INFO:Prepping for training...
INFO:Creating LoRA model...
INFO:Starting training...
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.14.2
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
{'train_runtime': 734.5184, 'train_samples_per_second': 8.659, 'train_steps_per_second': 0.065, 'train_loss': 2.2761232058207193, 'epoch': 0.18}
INFO:LoRA training run is completed and saved.
INFO:Training interrupted.

training

It's slightly faster using the adamw_bnb_8bit optimizer. On this gen of card it will never be super great due to the lack of HW matmul... but hey, us and the $3k V100 people are in the same boat. 💯

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this Dec 30, 2023
@TimDettmers
Copy link
Collaborator

One change is good, the other would be a degradation in speed. Lets discuss how to fix this while maintaining speed for other GPUs.

@TimDettmers TimDettmers reopened this Jan 1, 2024
@TimDettmers TimDettmers added medium priority (will be worked on after all high priority issues) Low Risk Risk of bugs in transformers and other libraries labels Jan 1, 2024
@Ph0rk0z
Copy link
Author

Ph0rk0z commented Jan 1, 2024

Haven't really compared this recently with the new codebase. In one of the issues a commenter said that only the first change was required.

@Titus-von-Koeller
Copy link
Collaborator

@Ph0rk0z I would love to make this PR actionable somehow but somehow I'm still struggling to understand what Tim means.

According to him one of the changes is good to go and the other "that adds lines which do 16 bit computation by casting the entire matrix to 16 bit that is more inefficient in many cases" needs improvement. Do you know what exactly he means and is this something we could wrap up together?

Maybe we can already merge the change that's good to go and handle the other one separately?

Force push over-rode but it isn't fixed.
What do you mean by that? Is the commit in the PR the only thing to consider or is there something missing?

@matthewdouglas
Copy link
Member

I think I understand the change in forward() but I'm struggling to understand what I see in backward().

The change in forward is at least limited in surface area to GPUs from Volta and older, so I suspect this is the part that is "good."

Has anyone run the unit tests with these changes?

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Mar 1, 2024

Can be tried with just the fwd change to see if it still NaNs. I think people were saying it worked. I basically moved to GPTQ/GGUF and this languished a while so haven't been paying attention and re-testing. My bad. Sat so long I didn't think it would be accepted.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Mar 4, 2024

I removed backwards pass so people can try it. Haven't had time to test on my machine yet, I'm down to my P6000 and P100 here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Low Risk Risk of bugs in transformers and other libraries medium priority (will be worked on after all high priority issues) waiting for info
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NaN error when using a GPU with no support for igemmlt
5 participants