Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't run on colab with Pygmalion-6B / results look different on Colab #14

Closed
waifusd opened this issue Jan 20, 2023 · 18 comments
Closed
Labels
bug Something isn't working

Comments

@waifusd
Copy link

waifusd commented Jan 20, 2023

Using the provided notebook and just changing the model to Pygmalion-6B instead of Pygmalion-1.3B generates the following tcmalloc error and the execution stops cold.

tcmalloc

@BushyToaster88
Copy link

BushyToaster88 commented Jan 20, 2023

Using the provided notebook and just changing the model to Pygmalion-6B instead of Pygmalion-1.3B generates the following tcmalloc error and the execution stops cold.

tcmalloc

I have the same issue when trying to run the 4chan model in colab. I get a "Memory cgroup out of memory" error when I look at the dmesg in the colab terminal.

@oobabooga oobabooga added the bug Something isn't working label Jan 20, 2023
@oobabooga
Copy link
Owner

I can confirm this issue.

On colab, I can't load either pygmalion-2.7b or pygmalion-6b.

The free colab has around 13GB RAM, while pygmalion-2.7b takes 6.8GB of RAM to load on my system (peak allocation). So it should in principle work.

@waifusd
Copy link
Author

waifusd commented Jan 20, 2023

Pygmalion-6B should be able to loaded on colab too, since the colab notebooks of other projects are able to load it (namely the Pyg devs own notebook, and KoboldAI's).

@oobabooga
Copy link
Owner

An Anonymous 4chan user has kindly provided this notebook that allows the 6b model to be loaded in 8bit mode:

https://colab.research.google.com/github/81300/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

I haven't tested it yet because Google is not giving me a free instance with a GPU.

@waifusd
Copy link
Author

waifusd commented Jan 20, 2023

I tried it:

pygoob1

@oobabooga
Copy link
Owner

I can confirm that the results look worse on colab than running locally. I made a comparison using debug.txt, which is a deterministic preset that should generate the same responses:

do_sample=False,
max_new_tokens=tokens,

But the responses are different:

Colab Local
colab2 local2

I don't know the cause and will leave this issue open to see if someone has an idea.

Some other observations:

  • It is now possible to load the 6b model with !python server.py --cai-chat --share --load-in-8bit or !python server.py --cai-chat --share --auto-devices (after the bitsandbytes version upgrade suggested by anon). The tcmalloc warnings still appear, but the model loads successfully.
  • My instance had a Tesla T4 GPU.

@waifusd
Copy link
Author

waifusd commented Jan 21, 2023

I still can't launch the "basic commands" colab even with the additional arguments.

I tried a little more on the Anon's colab with a different character and without fail, within the first 5 messages, I can always get the bot to sperg out with a normal chat:

image

Wish I could test a local installation but without a GPU I can only do colab.

@ghost
Copy link

ghost commented Jan 21, 2023

Disabling text output streaming through --no-stream appears to help with the results' quality on Colab.

Another issue is that the Python environment is inconsistent if you don't make sure Conda gets activated at each command. For instance if --load-in-8bit is used this can lead to bitsandbytes using an older libcudart.so that comes packaged with the Google container rather than the one installed through Conda.

The CPU RAM issues can be alleviated if we break the model into smaller shards, though I'm not sure yet of the negative side effects. I have updated that notebook adding an option and script to do this within the runtime.

@oobabooga
Copy link
Owner

oobabooga commented Jan 21, 2023

@81300 thanks for looking into this. The most likely culprit is indeed the CUDA library. The debug preset generating different results implies that the logits are different, which can only happen if the internal calculations are performed differently (different precision?).

--no-stream shouldn't change the results at all. It should only make the speed of text generation a bit faster than with streaming on. Try uploading a file called debug.txt under text-generation-webui/presets with the settings on my previous response and you will be able to verify that the results are the same when you run with or without streaming.

In any case, I don't know why, but the replies in your notebook are a LOT better now (although still different than running locally). I have sent 18 messages to Chiharu with the default settings and she didn't start ranting once.

@oobabooga
Copy link
Owner

My only remaining question is if it is possible to get the exact same responses on colab and locally. The colab responses feel a bit worse than local. Not nearly as bad as before, but still not as good.

The debug preset is now included by default.

@waifusd
Copy link
Author

waifusd commented Jan 24, 2023

Yesterday I tried the colab again with Pyg-6B, the bot didn't start to rant in the first 5 messages, and the answers were "slightly passable" at first (I've never been able to run it locally so I have no idea how that's supposed to look, so my only point of comparison is CAI, to which Pyg is not even remotely close to).
After 20 or so messages the bot started to answer with messages that were 80% identical to the last one, and it culminated to it responding with a 1:1 identical message. Tried regenerating or generating a message while staying silent (not inputting a message of my own) and the bot would spout the same identical message.

@ghost
Copy link

ghost commented Jan 24, 2023

@oobabooga, with the deterministic preset I currently get the same results locally and on Colab.

Local Colab
local-debug colab-debug

However there could potentially be other aspects at play than the inference settings, as described in [1] and [2]. Locally I could only test with a RTX 2000 series card which--like the Tesla T4s I've been assigned each time on Colab--is on the Turing architecture. Perhaps cuDNN [3] behaves differently on your Ampere card.

Now I feel like on Colab forcing the app, via Conda, to use different CUDA libraries than the ones that are preinstalled by Google is wrong because the instance isn't exactly bare-metal, it runs in a Docker container [4] and the host has its own CUDA drivers.

[1] https://pytorch.org/docs/stable/notes/randomness.html
[2] https://www.mldawn.com/reproducibility-in-pytorch/
[3] https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html
[4] https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html

@oobabooga
Copy link
Owner

It is also possible that the divergent results are caused by device_map='auto' and not the GPU architecture, see this issue:

huggingface/transformers#20896

@oobabooga oobabooga changed the title Doesn't run on colab with Pygmalion-6B Doesn't run on colab with Pygmalion-6B / results look different on Colab Jan 24, 2023
@oobabooga
Copy link
Owner

Here is the

Hi

test on my RTX 3090 with the debug preset, for comparison. On my laptop which has a turing gpu the results are the same as Colab and @81300.

The goal is to reproduce this >Hi result on any GPU.

print

@ghost
Copy link

ghost commented Jan 24, 2023

It is also possible that the divergent results are caused by device_map='auto' and not the GPU architecture, see this issue:

huggingface/transformers#20896

Thanks.

We still can't load Pygmalion-6B on Colab without --auto-devices which is because the model simply doesn't fit into the 12GB of CPU RAM initially. From what I gather that must happen anyway before it can be fully sent to the GPU. The fact that it worked with 8-bit quantization is a side effect of 8-bit loading requiring device_map='auto' (or maybe passing a custom map could work but I haven't made one).

To get around the memory issue we could try DeepSpeed ZeRO-3 inference, in which case if you'd launch a single process on a single GPU, the CPU RAM requirement should just be the size of the biggest shard in your model. Will update later on this.

So I retested on Colab without --auto-devices (and using the above PR changes) with Pygmalion-1.3B and the following scenarios:

  • No extra arguments for loading: ok as a base test
  • --auto-devices: same output as the first
  • --cpu: same output
  • Model converted to .pt: same output
  • --load-in-8bit: different output but that's to be expected probably

I also tried EleutherAI/gpt-neo-125M on Tesla T4 with the task and Python snippet from the issue you linked but my results were correct.

@oobabooga
Copy link
Owner

oobabooga commented Jan 25, 2023

@81300 @waifusd, I have discovered the issue: it was the model.

The model that I was using on my computer was the very first commit to the HuggingFace repository, which I downloaded on January 12nd.

The current commit to that repository (main branch) is different from the first one. This updated commit passes the >Hi test when executed in GPU-only mode, but fails in CPU, GPU+CPU, or 8-bit mode, generating the "what do you think of my setup?" response that we have been seeing.

On the other hand, the first commit passes the >Hi test in any mode. It always yields the same responses. This is the response that I got on colab using this commit:

colab

I have re-uploaded this first commit here: https://huggingface.co/oobabooga/pygmalion-6b-original This is not necessary and I will delete it, just download the previous commit using python download-model.py PygmalionAI/pygmalion-6b --branch b8344bb4eb76a437797ad3b19420a13922aaabe1

You can try it in this notebook (which is the one I used for the screenshot above): https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

In other words, it is now possible to get 1000x better responses on Colab.

@oobabooga
Copy link
Owner

@81300: you are right that the next step now would be to ditch 8-bit mode altogether, as this would probably make the model run a bit faster.

I am also worried about the loading times on Colab. Ideally, it would be best to get the model working in less than 5 minutes instead of 12.

@oobabooga
Copy link
Owner

Here are some comparisons between different branches of pygmalion-6b and different modes (GPU, GPU+CPU, and 8-bit):

https://huggingface.co/PygmalionAI/pygmalion-6b/discussions/8#63d15cae119416cdbe15ae2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants