-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327
Comments
I have no idea what could be causing this. Some people have had better luck using WSL than windows itself. |
Just chiming in that I'm experiencing the same issue with 13b. It stalls for ~15 minutes after the bitsandbytes CUDA SETUP phase, without any significant disk utilization and only one CPU core running at full, before finally kicking over to the shard loading. Same error on attempting to request a response, the 'inf' 'nan' or element <0. Win 11 |
I am experiencing the same issue |
I'm experiencing the same issue on windows 10 using RTX 3060 but using the LLama 7b one, the installer is located on an SSD but it takes 18 minutes to get past the CUDA setup
and I also get the inf, nan error when trying to generate I followed the instructions here about the bitsandbytes guide: If I uncheck the |
Update: I redownloaded the weights using HFv2 and now it loads fast as expected, it seems the weights were the problem. |
I had the exact same problem. My weights were from the original that were converted using convert_llama_weights_to_hf.py. The weights from Decapoda Research are different and fixed it. (Thanks!) |
Can also confirm that I re-downloaded the weights (from the same Torrent) and now it works fine. Weird. |
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below. |
Describe the bug
When I try to load llama 13b-hf in full precision using the following command, it takes approximately 30 minutes to load then about 70s to load into vram. This model is located on an NVMe drive and other models like OPT load fine and immediately. I am splitting between 2 GPUs and this was working not too long ago just fine.
python server.py --listen --model llama-13b --gpu-memory 21 13
Also, I should note, forcing the--bf16
flag does not help.Is there an existing issue for this?
Reproduction
Run the previous command on a new venv with the updated transformers branch (latest
requirements.txt
/project pulled)Screenshot
See below.
Logs
As you can see, it loads extremely slowly into system memory then completely empties it and loads into VRAM. No logging in between
Loading llama-13b...
is given, it just hangs until the model is loaded into system memory.I did another test with the
--bf16
flag and it loaded much faster but still slowly. (I did up the max memory too but not sure if that really did anything as it's still far below the max for my GPUs.)'3090 - 13.1GB
A4000 - 12.3GB
Additional Logs
Running the model after loading produces this
System Info
Note: I made some edits to this issue as I tried to add some more detail/methods.
The text was updated successfully, but these errors were encountered: