Doesn't run on colab with Pygmalion-6B / results look different on Colab #14

waifusd · 2023-01-20T09:21:10Z

Using the provided notebook and just changing the model to Pygmalion-6B instead of Pygmalion-1.3B generates the following tcmalloc error and the execution stops cold.

BushyToaster88 · 2023-01-20T11:12:12Z

Using the provided notebook and just changing the model to Pygmalion-6B instead of Pygmalion-1.3B generates the following tcmalloc error and the execution stops cold.

I have the same issue when trying to run the 4chan model in colab. I get a "Memory cgroup out of memory" error when I look at the dmesg in the colab terminal.

oobabooga · 2023-01-20T12:54:22Z

I can confirm this issue.

On colab, I can't load either pygmalion-2.7b or pygmalion-6b.

The free colab has around 13GB RAM, while pygmalion-2.7b takes 6.8GB of RAM to load on my system (peak allocation). So it should in principle work.

waifusd · 2023-01-20T13:40:44Z

Pygmalion-6B should be able to loaded on colab too, since the colab notebooks of other projects are able to load it (namely the Pyg devs own notebook, and KoboldAI's).

oobabooga · 2023-01-20T18:24:06Z

An Anonymous 4chan user has kindly provided this notebook that allows the 6b model to be loaded in 8bit mode:

https://colab.research.google.com/github/81300/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

I haven't tested it yet because Google is not giving me a free instance with a GPU.

waifusd · 2023-01-20T18:40:41Z

I tried it:

oobabooga · 2023-01-20T22:05:59Z

I can confirm that the results look worse on colab than running locally. I made a comparison using debug.txt, which is a deterministic preset that should generate the same responses:

do_sample=False,
max_new_tokens=tokens,

But the responses are different:

Colab	Local

I don't know the cause and will leave this issue open to see if someone has an idea.

Some other observations:

It is now possible to load the 6b model with !python server.py --cai-chat --share --load-in-8bit or !python server.py --cai-chat --share --auto-devices (after the bitsandbytes version upgrade suggested by anon). The tcmalloc warnings still appear, but the model loads successfully.
My instance had a Tesla T4 GPU.

waifusd · 2023-01-21T09:59:44Z

I still can't launch the "basic commands" colab even with the additional arguments.

I tried a little more on the Anon's colab with a different character and without fail, within the first 5 messages, I can always get the bot to sperg out with a normal chat:

Wish I could test a local installation but without a GPU I can only do colab.

ghost · 2023-01-21T20:34:00Z

Disabling text output streaming through --no-stream appears to help with the results' quality on Colab.

Another issue is that the Python environment is inconsistent if you don't make sure Conda gets activated at each command. For instance if --load-in-8bit is used this can lead to bitsandbytes using an older libcudart.so that comes packaged with the Google container rather than the one installed through Conda.

The CPU RAM issues can be alleviated if we break the model into smaller shards, though I'm not sure yet of the negative side effects. I have updated that notebook adding an option and script to do this within the runtime.

oobabooga · 2023-01-21T22:54:10Z

@81300 thanks for looking into this. The most likely culprit is indeed the CUDA library. The debug preset generating different results implies that the logits are different, which can only happen if the internal calculations are performed differently (different precision?).

--no-stream shouldn't change the results at all. It should only make the speed of text generation a bit faster than with streaming on. Try uploading a file called debug.txt under text-generation-webui/presets with the settings on my previous response and you will be able to verify that the results are the same when you run with or without streaming.

In any case, I don't know why, but the replies in your notebook are a LOT better now (although still different than running locally). I have sent 18 messages to Chiharu with the default settings and she didn't start ranting once.

oobabooga · 2023-01-24T02:02:23Z

My only remaining question is if it is possible to get the exact same responses on colab and locally. The colab responses feel a bit worse than local. Not nearly as bad as before, but still not as good.

The debug preset is now included by default.

waifusd · 2023-01-24T07:44:15Z

Yesterday I tried the colab again with Pyg-6B, the bot didn't start to rant in the first 5 messages, and the answers were "slightly passable" at first (I've never been able to run it locally so I have no idea how that's supposed to look, so my only point of comparison is CAI, to which Pyg is not even remotely close to).
After 20 or so messages the bot started to answer with messages that were 80% identical to the last one, and it culminated to it responding with a 1:1 identical message. Tried regenerating or generating a message while staying silent (not inputting a message of my own) and the bot would spout the same identical message.

ghost · 2023-01-24T08:30:36Z

@oobabooga, with the deterministic preset I currently get the same results locally and on Colab.

Local	Colab

However there could potentially be other aspects at play than the inference settings, as described in [1] and [2]. Locally I could only test with a RTX 2000 series card which--like the Tesla T4s I've been assigned each time on Colab--is on the Turing architecture. Perhaps cuDNN [3] behaves differently on your Ampere card.

Now I feel like on Colab forcing the app, via Conda, to use different CUDA libraries than the ones that are preinstalled by Google is wrong because the instance isn't exactly bare-metal, it runs in a Docker container [4] and the host has its own CUDA drivers.

[1] https://pytorch.org/docs/stable/notes/randomness.html
[2] https://www.mldawn.com/reproducibility-in-pytorch/
[3] https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html
[4] https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html

oobabooga · 2023-01-24T13:27:52Z

It is also possible that the divergent results are caused by device_map='auto' and not the GPU architecture, see this issue:

huggingface/transformers#20896

oobabooga · 2023-01-24T21:35:07Z

Here is the

Hi

test on my RTX 3090 with the debug preset, for comparison. On my laptop which has a turing gpu the results are the same as Colab and @81300.

The goal is to reproduce this >Hi result on any GPU.

ghost · 2023-01-24T23:31:00Z

It is also possible that the divergent results are caused by device_map='auto' and not the GPU architecture, see this issue:

huggingface/transformers#20896

Thanks.

We still can't load Pygmalion-6B on Colab without --auto-devices which is because the model simply doesn't fit into the 12GB of CPU RAM initially. From what I gather that must happen anyway before it can be fully sent to the GPU. The fact that it worked with 8-bit quantization is a side effect of 8-bit loading requiring device_map='auto' (or maybe passing a custom map could work but I haven't made one).

To get around the memory issue we could try DeepSpeed ZeRO-3 inference, in which case if you'd launch a single process on a single GPU, the CPU RAM requirement should just be the size of the biggest shard in your model. Will update later on this.

So I retested on Colab without --auto-devices (and using the above PR changes) with Pygmalion-1.3B and the following scenarios:

No extra arguments for loading: ok as a base test
--auto-devices: same output as the first
--cpu: same output
Model converted to .pt: same output
--load-in-8bit: different output but that's to be expected probably

I also tried EleutherAI/gpt-neo-125M on Tesla T4 with the task and Python snippet from the issue you linked but my results were correct.

oobabooga · 2023-01-25T01:55:10Z

@81300 @waifusd, I have discovered the issue: it was the model.

The model that I was using on my computer was the very first commit to the HuggingFace repository, which I downloaded on January 12nd.

The current commit to that repository (main branch) is different from the first one. This updated commit passes the >Hi test when executed in GPU-only mode, but fails in CPU, GPU+CPU, or 8-bit mode, generating the "what do you think of my setup?" response that we have been seeing.

On the other hand, the first commit passes the >Hi test in any mode. It always yields the same responses. This is the response that I got on colab using this commit:

~~I have re-uploaded this first commit here: https://huggingface.co/oobabooga/pygmalion-6b-original~~ This is not necessary and I will delete it, just download the previous commit using python download-model.py PygmalionAI/pygmalion-6b --branch b8344bb4eb76a437797ad3b19420a13922aaabe1

You can try it in this notebook (which is the one I used for the screenshot above): https://colab.research.google.com/github/oobabooga/AI-Notebooks/blob/main/Colab-TextGen-GPU.ipynb

In other words, it is now possible to get 1000x better responses on Colab.

oobabooga · 2023-01-25T01:57:55Z

@81300: you are right that the next step now would be to ditch 8-bit mode altogether, as this would probably make the model run a bit faster.

I am also worried about the loading times on Colab. Ideally, it would be best to get the model working in less than 5 minutes instead of 12.

oobabooga · 2023-01-25T16:48:39Z

Here are some comparisons between different branches of pygmalion-6b and different modes (GPU, GPU+CPU, and 8-bit):

https://huggingface.co/PygmalionAI/pygmalion-6b/discussions/8#63d15cae119416cdbe15ae2e

oobabooga added the bug Something isn't working label Jan 20, 2023

oobabooga changed the title ~~Doesn't run on colab with Pygmalion-6B~~ Doesn't run on colab with Pygmalion-6B / results look different on Colab Jan 24, 2023

ghost mentioned this issue Jan 24, 2023

Change --auto-devices and model offload behavior #24

Closed

oobabooga closed this as completed Jan 26, 2023

ghost mentioned this issue Feb 3, 2023

Implement ZeRO inference #40

Closed

mironkraft mentioned this issue May 19, 2023

python3 server.py --chat Ilegal instruction ('core' generated) #2163

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't run on colab with Pygmalion-6B / results look different on Colab #14

Doesn't run on colab with Pygmalion-6B / results look different on Colab #14

waifusd commented Jan 20, 2023

BushyToaster88 commented Jan 20, 2023 •

edited

Loading

oobabooga commented Jan 20, 2023

waifusd commented Jan 20, 2023

oobabooga commented Jan 20, 2023

waifusd commented Jan 20, 2023

oobabooga commented Jan 20, 2023

waifusd commented Jan 21, 2023

ghost commented Jan 21, 2023

oobabooga commented Jan 21, 2023 •

edited

Loading

oobabooga commented Jan 24, 2023

waifusd commented Jan 24, 2023

ghost commented Jan 24, 2023

oobabooga commented Jan 24, 2023

oobabooga commented Jan 24, 2023

ghost commented Jan 24, 2023 •

edited by ghost

Loading

oobabooga commented Jan 25, 2023 •

edited

Loading

oobabooga commented Jan 25, 2023

oobabooga commented Jan 25, 2023

Doesn't run on colab with Pygmalion-6B / results look different on Colab #14

Doesn't run on colab with Pygmalion-6B / results look different on Colab #14

Comments

waifusd commented Jan 20, 2023

BushyToaster88 commented Jan 20, 2023 • edited Loading

oobabooga commented Jan 20, 2023

waifusd commented Jan 20, 2023

oobabooga commented Jan 20, 2023

waifusd commented Jan 20, 2023

oobabooga commented Jan 20, 2023

waifusd commented Jan 21, 2023

ghost commented Jan 21, 2023

oobabooga commented Jan 21, 2023 • edited Loading

oobabooga commented Jan 24, 2023

waifusd commented Jan 24, 2023

ghost commented Jan 24, 2023

oobabooga commented Jan 24, 2023

oobabooga commented Jan 24, 2023

ghost commented Jan 24, 2023 • edited by ghost Loading

oobabooga commented Jan 25, 2023 • edited Loading

oobabooga commented Jan 25, 2023

oobabooga commented Jan 25, 2023

BushyToaster88 commented Jan 20, 2023 •

edited

Loading

oobabooga commented Jan 21, 2023 •

edited

Loading

ghost commented Jan 24, 2023 •

edited by ghost

Loading

oobabooga commented Jan 25, 2023 •

edited

Loading