Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error #327

Closed
1 task done
official-elinas opened this issue Mar 15, 2023 · 8 comments
Labels
bug Something isn't working stale

Comments

@official-elinas
Copy link

official-elinas commented Mar 15, 2023

Describe the bug

When I try to load llama 13b-hf in full precision using the following command, it takes approximately 30 minutes to load then about 70s to load into vram. This model is located on an NVMe drive and other models like OPT load fine and immediately. I am splitting between 2 GPUs and this was working not too long ago just fine.

python server.py --listen --model llama-13b --gpu-memory 21 13

Also, I should note, forcing the --bf16 flag does not help.

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

Run the previous command on a new venv with the updated transformers branch (latest requirements.txt/project pulled)

Screenshot

See below.

Logs

$ python server.py --listen --model llama-13b --gpu-memory 21 13
Loading llama-13b...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 41/41 [01:11<00:00,  1.74s/it]
Loaded the model in 1880.04 seconds.

As you can see, it loads extremely slowly into system memory then completely empties it and loads into VRAM. No logging in between Loading llama-13b... is given, it just hangs until the model is loaded into system memory.

I did another test with the --bf16 flag and it loaded much faster but still slowly. (I did up the max memory too but not sure if that really did anything as it's still far below the max for my GPUs.)'

3090 - 13.1GB
A4000 - 12.3GB

$ python server.py --listen --model llama-13b --bf16 --gpu-memory 23 15
Loading llama-13b...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 41/41 [01:12<00:00,  1.76s/it]
Loaded the model in 750.97 seconds.

Additional Logs

Running the model after loading produces this

Traceback (most recent call last):
  File "G:\llm-webui\installer_files\env\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "G:\llm-webui\installer_files\env\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "G:\llm-webui\text-generation-webui\modules\callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "G:\llm-webui\text-generation-webui\modules\text_generation.py", line 191, in generate_with_callback
    shared.model.generate(**kwargs)
  File "G:\llm-webui\installer_files\env\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "G:\llm-webui\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1452, in generate
    return self.sample(
  File "G:\llm-webui\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2504, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

System Info

Windows 10
i7-5960X
64 GB RAM
RTX 3090 24GB 
RTX A4000 16GB
Samsung 980 NVMe (models on this drive)

Note: I made some edits to this issue as I tried to add some more detail/methods.

@official-elinas official-elinas added the bug Something isn't working label Mar 15, 2023
@official-elinas official-elinas changed the title Llama 13b Model in full precision loading into system RAM and taking 30 minutes to fully load Llama 13b Model in full precision loading into system RAM and taking 30 minutes to fully load & runtime error Mar 15, 2023
@official-elinas official-elinas changed the title Llama 13b Model in full precision loading into system RAM and taking 30 minutes to fully load & runtime error Llama 13b Model in full precision loading into system RAM, taking 30 minutes to fully load & runtime error Mar 15, 2023
@oobabooga
Copy link
Owner

I have no idea what could be causing this. Some people have had better luck using WSL than windows itself.

@JousterL
Copy link

JousterL commented Mar 15, 2023

Just chiming in that I'm experiencing the same issue with 13b. It stalls for ~15 minutes after the bitsandbytes CUDA SETUP phase, without any significant disk utilization and only one CPU core running at full, before finally kicking over to the shard loading. Same error on attempting to request a response, the 'inf' 'nan' or element <0.

Win 11
Ryzen 9 7950X
64 GB RAM
RTX 4090 24GB
Model is on a 2TB WD Black spinning rust.

@Simon1V
Copy link

Simon1V commented Mar 15, 2023

I am experiencing the same issue
Windows 10
i7 9700k
RTX A6000, GTX 1660Super
32GB RAM
Evo 970 NVME drive

@YukiSakuma
Copy link

I'm experiencing the same issue on windows 10 using RTX 3060 but using the LLama 7b one, the installer is located on an SSD but it takes 18 minutes to get past the CUDA setup

Starting the web UI...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Loading binary A:\oobabooga-windows\oobabooga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
Loading LLaMA-7B...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 33/33 [00:10<00:00,  3.01it/s]
Loaded the model in 1113.38 seconds.
Loading the extension "gallery"... Ok.
A:\oobabooga-windows\oobabooga\installer_files\env\lib\site-packages\gradio\deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

and I also get the inf, nan error when trying to generate
RuntimeError: probability tensor contains either inf, nan or element < 0

I followed the instructions here about the bitsandbytes guide:
#147 (comment)

If I uncheck the do_sample the error disappears but the generated response I get is ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇

@YukiSakuma
Copy link

Update: I redownloaded the weights using HFv2 and now it loads fast as expected, it seems the weights were the problem.

@demoergo
Copy link

demoergo commented Mar 20, 2023

Update: I redownloaded the weights using HFv2 and now it loads fast as expected, it seems the weights were the problem.

I had the exact same problem. My weights were from the original that were converted using convert_llama_weights_to_hf.py. The weights from Decapoda Research are different and fixed it. (Thanks!)

@JousterL
Copy link

Can also confirm that I re-downloaded the weights (from the same Torrent) and now it works fine. Weird.

@github-actions github-actions bot added the stale label Apr 19, 2023
@github-actions
Copy link

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

6 participants