-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ZeRO inference #40
Comments
This is definitely worth looking into. There is also ONNX, which has a similar promise of improving inference speeds. I have never really understood how to use those engines and how production-ready they are compared to transformers. |
Here's a draft with some ideas: #43 There would be 3 new arguments for
DeepSpeed must be installed: Using Running should be done with the DeepSpeed launcher: deepspeed --num_gpus=1 server.py --cai-chat --deepspeed For NVME offloading: deepspeed --num_gpus=1 server.py --cai-chat --deepspeed --nvme-offload-dir /mnt/offload YMMV. While I haven't tested multi-GPU setups yet, in my tests the VRAM usage on a single card was greatly optimized. |
That's very exciting. I will test and merge it later today. Other than VRAM usage, did you see a noticeable improvement in the text generation speed? |
I've seen the opposite, probably due to the partitioning that happens under ZeRO-3. It seems like using this would only make sense if you have large models to load, or if you want to make use of multiple GPUs. The Hugging Face docs do warn about performance and give a few more tuning tips here:
Another thing which is a bit confusing is that the ZeRO-Inference that's integrated into Here is a good rundown:
And an interesting article: Now, I've also tried DeepSpeed-Inference briefly but they have a number of bugs that are being worked on with regards to using split Hugging Face checkpoints (like Pygmalion 6B), bad output and model incompatibility. Worth keeping an eye on, however. |
I have accepted the PR and have some observations:
|
Could be something to do with Conda, try: $ conda install -c conda-forge gcc
I've seen that. Limiting the memory with cgroups can help: $ systemd-run --user --scope -p MemoryHigh=15G -p MemoryMax=16G bash
$ conda activate textgen
$ deepspeed --num_gpus=1 server.py --model pygmalion-6b --cai-chat --deepspeed
(.....)
DeepSpeed ZeRO-3 is enabled: True
Loaded the model in 98.31 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Dialogue tokenized to:
| So how did you get into computer engineering?
|
Installing the latest gcc version with conda worked, but then As for limiting the maximum RAM with systemd-run, that caused deepspeed to become unresponsive and never load the model, even after several minutes. |
Strange, could you check what
Admittedly the test above was with having 8GB of swap available and loading was much, much slower (nearly 2 minutes). I want to verify if using models that are sharded into smaller chunks really makes a difference for the initial RAM requirement. |
Is there any actual benefit in using bfloat16 if the card supports it (Ampere & Lovelace) ? |
@Manimap, the docs claim it's faster. There's also a caveat for fp16:
|
Sharding appears to help. For instance, trying to load the unsharded OPT-13B-Erebus model with 30GB of CPU RAM, 8GB of swap and NVME offloading led to OOM. $ ls models/OPT-13B-Erebus
config.json LICENSE.md merges.txt pytorch_model.bin README.md special_tokens_map.json tokenizer_config.json vocab.json $ systemd-run --user --scope -p MemoryHigh=28G -p MemoryMax=30G bash
$ /usr/bin/time -f %M deepspeed --num_gpus=1 server.py --model OPT-13B-Erebus --notebook --deepspeed --nvme-offload-dir /mnt/offload/
[] [INFO] [launch.py:162:main] dist_world_size=1
[] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading OPT-13B-Erebus...
[] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 21388
[] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'server.py', '--local_rank=0', '--model', 'OPT-13B-Erebus', '--notebook', '--deepspeed', '--nvme-offload-dir', '/mnt/offload/'] exits with return code = -9
Command exited with non-zero status 247
30723720 OPT-13B-Erebus sharded into 1GB chunks on the other hand could be loaded and the peak RAM usage looked better. $ ls models/OPT-13B-Erebus-sharded
config.json pytorch_model-00005-of-00028.bin pytorch_model-00011-of-00028.bin pytorch_model-00017-of-00028.bin pytorch_model-00023-of-00028.bin pytorch_model.bin.index.json
merges.txt pytorch_model-00006-of-00028.bin pytorch_model-00012-of-00028.bin pytorch_model-00018-of-00028.bin pytorch_model-00024-of-00028.bin special_tokens_map.json
pytorch_model-00001-of-00028.bin pytorch_model-00007-of-00028.bin pytorch_model-00013-of-00028.bin pytorch_model-00019-of-00028.bin pytorch_model-00025-of-00028.bin tokenizer_config.json
pytorch_model-00002-of-00028.bin pytorch_model-00008-of-00028.bin pytorch_model-00014-of-00028.bin pytorch_model-00020-of-00028.bin pytorch_model-00026-of-00028.bin tokenizer.json
pytorch_model-00003-of-00028.bin pytorch_model-00009-of-00028.bin pytorch_model-00015-of-00028.bin pytorch_model-00021-of-00028.bin pytorch_model-00027-of-00028.bin vocab.json
pytorch_model-00004-of-00028.bin pytorch_model-00010-of-00028.bin pytorch_model-00016-of-00028.bin pytorch_model-00022-of-00028.bin pytorch_model-00028-of-00028.bin $ systemd-run --user --scope -p MemoryHigh=28G -p MemoryMax=30G bash
$ deepspeed --num_gpus=1 server.py --model OPT-13B-Erebus-sharded --notebook --deepspeed --nvme-offload-dir /mnt/offload/
[] [INFO] [launch.py:162:main] dist_world_size=1
[] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Loading OPT-13B-Erebus-sharded...
[] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 13.11B parameters
DeepSpeed ZeRO-3 is enabled: True
Loaded the model in 86.68 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`. |
This is useful to know @81300 Could this be used to instantiate models on Colab without a huge RAM usage? If so, it could be possible to initialize a new notebook by only installing the |
Alright thanks, so faster for those who can run it, and maybe some pros of training models in this "mixed precision" in particular. |
I tested Colab today. For larger models the ZeRO-3 CPU/NVME offloading makes heavy use of CPU RAM anyway. Google's safety mechanisms seem very sensitive, they will kill your process even if DeepSpeed would not actually run out of memory. You can't use cgroups to throttle properly because the Colab runtime is within an unprivileged container. For that same reason you cannot create swap. The DeepSpeed config doesn't provide any knobs for max RAM to offload with (they have an open issue). That said, you can of course disable offloading entirely and successfully instantiate a sharded Pygmalion 6B model onto the GPU with ZeRO-3. This requires very little CPU RAM - just the size of the biggest shard. But in this scenario the Nvidia T4 will run out of VRAM once you begin generating text. Maybe tuning By the way, I discovered that a presharded Pygmalion 6B consisting of 2GB chunks instantiates just fine on the free Colab w/o DeepSpeed, 8-bit mode (#14 (comment)), auto-devices or Conda. Inference works. However the sharding must be done on a system with sufficient memory so I had to rehost the model (not ideal). Can play with a test notebook here. As you suggested, it doesn't install Conda and therefore loads up much quicker! |
@81300 with your resharded+safetensors rehost, the Colab loading times for Indeed, using a rehost is not as pretty as lazy loading the model from disk the way the Kobold client does, but at the same time this allowed by the
Yes, I have also noticed that. It's very annoying. ZeRO-3 was not necessary for Colab for now, but maybe it will be later. In your computer, are you using it as your default way of offloading layers (instead of --auto-devices)? |
Just saying, but I made a pytorch bin file to safetensor converter that runs locally based on this if anyone is interested: |
@Silver267 I am interested, thank you for making this. |
Yes, I've been using it for CPU offloading mostly. In @Silver267 - nice. In case it's useful for your project, I resharded Pygmalion using this. |
@81300 Thanks for the information! Though the code doesn't seem to support ram offload (my vram is 8gb), it would still be a useful reference. |
I am also getting the same I found that libaio has issues with DeepSpeed on archlinux |
for some reason when I run with deepspeed I get
I'm running inside docker on wsl2
|
Since ZeRO inference is implemented and seems to be working, closing this issue. Please open another issue if there are other problems. |
This doesn't work in a multi-gpu setup because all the multiple MPI instances of server.py try to bind to the web port and will fail. srun --nodes=1 --cpus-per-task 16 --gres=gpu:4 --pty ./run.sh
The activation script must be sourced, otherwise the virtual environment will not work.
Setting vars
[2023-06-01 20:26:11,291] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2023-06-01 20:26:11,409] [INFO] [runner.py:541:main] cmd = /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ./server.py --deepspeed --chat --threads 24 --listen-host 0.0.0.0 --listen-port 5000 --listen --xformers --sdp-attention --trust-remote-code
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBVERSIONNCCL=2.12.12
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBROOTNCCL=/easybuild/2020/software/NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0
[2023-06-01 20:26:18,273] [INFO] [launch.py:222:main] 0 EBDEVELNCCL=/easybuild/2020/software/NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0/easybuild/NCCL-2.12.12-GCCcore-11.3.0-CUDA-11.7.0-easybuild-devel
[2023-06-01 20:26:18,273] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-01 20:26:18,273] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-01 20:26:18,273] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-01 20:26:18,273] [INFO] [launch.py:247:main] dist_world_size=4
[2023-06-01 20:26:18,273] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
WARNING:trust_remote_code is enabled. This is dangerous.
[2023-06-01 20:26:31,737] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
bin /p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading settings from settings.json...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
INFO:Loading the extension "gallery"...
Running on local URL: http://0.0.0.0:5000
To create a public link, set `share=True` in `launch()`.
ERROR: Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
await receive()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
return await self.receive_queue.get()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
await getter
asyncio.exceptions.CancelledError
Exception in thread Thread-1 (run):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
ERROR: Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
await receive()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
return await self.receive_queue.get()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
await getter
asyncio.exceptions.CancelledError
Exception in thread Thread-1 (run):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
ERROR: Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/starlette/routing.py", line 686, in lifespan
await receive()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/lifespan/on.py", line 139, in receive
return await self.receive_queue.get()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/queues.py", line 159, in get
await getter
asyncio.exceptions.CancelledError
Exception in thread Thread-1 (run):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 161, in startup
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
self._target(*self._args, **self._kwargs)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
server = await loop.create_server(
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 1505, in create_server
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
raise OSError(err.errno, 'error while attempting '
OSError: [Errno 98] error while attempting to bind on address ('0.0.0.0', 5000): address already in use
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
self.run()self._target(*self._args, **self._kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/threading.py", line 946, in run
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
self._target(*self._args, **self._kwargs)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
return asyncio.run(self.serve(sockets=sockets))
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
return asyncio.run(self.serve(sockets=sockets))
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
return asyncio.run(self.serve(sockets=sockets))
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 78, in serve
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
await self.startup(sockets=sockets)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/uvicorn/server.py", line 169, in startup
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
logger.error(exc)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1506, in error
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
self._log(ERROR, msg, args, **kwargs)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1624, in _log
self.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1634, in handle
self.callHandlers(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
hdlr.handle(record)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/logging/__init__.py", line 968, in handle
self.emit(record)
File "/p/haicluster/llama/text-generation-webui/modules/logging_colors.py", line 96, in new
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
args[1].msg = color + args[1].msg + '\x1b[0m' # normal
TypeError: can only concatenate str (not "OSError") to str
^C[2023-06-01 20:27:02,022] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
Traceback (most recent call last):
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1109, in <module>
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1111, in <module>
[2023-06-01 20:27:02,024] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
create_interface()
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
create_interface()
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
create_interface()
File "/p/haicluster/llama/text-generation-webui/./server.py", line 1014, in create_interface
time.sleep(0.5)
KeyboardInterrupt
shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
shared.gradio['interface'].launch(prevent_thread_lock=True, share=shared.args.share, server_name=shared.args.listen_host or '0.0.0.0', server_port=shared.args.listen_port, inbrowser=shared.args.auto_launch, auth=auth)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1743, in launch
server_name, server_port, local_url, app, server = networking.start_server(
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 161, in start_server
server_name, server_port, local_url, app, server = networking.start_server(
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 161, in start_server
Traceback (most recent call last):
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/bin/deepspeed", line 6, in <module>
[2023-06-01 20:27:02,123] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350800
main()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 556, in main
result.wait()
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1204, in wait
return self._wait(timeout=timeout)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1938, in _wait
[2023-06-01 20:27:02,133] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350801
(pid, sts) = self._try_wait(0)
File "/easybuild/2020/software/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/subprocess.py", line 1896, in _try_wait
(pid, sts) = os.waitpid(self.pid, wait_flags)
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 548, in sigkill_handler
result_kill = subprocess.Popen(kill_cmd, env=env)
NameError: free variable 'kill_cmd' referenced before assignment in enclosing scope
server.run_in_thread()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 39, in run_in_thread
server.run_in_thread()
File "/p/haicluster/llama/text-generation-webui/sc_venv_template/venv/lib/python3.10/site-packages/gradio/networking.py", line 39, in run_in_thread
time.sleep(1e-3)
KeyboardInterrupt
time.sleep(1e-3)
KeyboardInterrupt
[2023-06-01 20:27:02,195] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350802
[2023-06-01 20:27:02,256] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 350803
[2023-06-01 20:27:02,315] [INFO] [launch.py:437:sigkill_handler] Main process received SIGTERM, exiting
srun: error: haicluster3: task 0: Exited with exit code 1 |
Did someone run tests of DeepSpeed with Intel AMX capable CPU (Xeon 4th gen, Sapphire Rapids)? |
Hotfix ipex import related error
Some information
Seems like ZeRO inference could improve the performance of offloading to RAM/NVMe. I don't know if huggingface's accelerate is already using it, but if not, it would be a great feature to add.
The text was updated successfully, but these errors were encountered: