Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generation server scripts using HF accelerate and DS-inference #328

Merged
merged 33 commits into from
Sep 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
107 changes: 107 additions & 0 deletions scripts/bloom-inference-server/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
## Inference solutions for BLOOM 176B
We support HuggingFace accelerate and DeepSpeed Inference for generation.

Install required packages:

```shell
pip install fastapi uvicorn accelerate huggingface_hub>=0.9.0
```
To install [DeepSpeed](https://github.com/microsoft/DeepSpeed):
```shell
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
CFLAGS="-I$CONDA_PREFIX/include/" LDFLAGS="-L$CONDA_PREFIX/lib/" TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
```
To install [DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII):
```shell
git clone https://github.com/microsoft/DeepSpeed-MII
cd DeepSpeed-MII
pip install .
```

All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B. These scripts might not work for other models or a different number of GPUs.
DS inference only supports fp16 for cli and server application. However, for benchmarking, it supports both fp16 and bf16. bf16 support will be added once DeepSpeed adds suitable CUDA kernels for these.

DS inference is deployed using the DeepSpeed MII library which requires the resharded checkpoints for 8 x Tensor Parallel. The HuggingFace checkpoints can be resharded and cached using the following command:
```shell
deepspeed --num_gpus 8 scripts/bloom-inference-server/cache_ds_checkpoints.py --model_name bigscience/bloom --dtype fp16 --save_mp_checkpoint_path <PATH TO DS CACHED MODEL>
```
Note: Running the above script will consume ~350 GB of disk space and will take some time (~30 minutes), depending on both the speed of your GPUs and storage.

Note: sometimes GPU memory is not freed when DS inference deployment is shutdown. You can free this memory by running:
```python
import mii
mii.terminate("ds_inference_grpc_server")
```
or alternatively, just doing a `killall python` in terminal.

#### BLOOM inference via command-line
This asks for generate_kwargs everytime.
Example: generate_kwargs =
```json
{"min_length": 100, "max_new_tokens": 100, "do_sample": false}
```

1. using HF accelerate
```shell
python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
```

2. using DS inference
```shell
python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say make --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> optional - why save it every time?

and list it as an option below for those who want it - and also how to load it after it was saved of course.

Copy link
Member

@stas00 stas00 Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried running this one? I run into multiple issues here:

python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path cache --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
Traceback (most recent call last):
  File "scripts/bloom-inference-server/cli.py", line 80, in <module>
    main()
  File "scripts/bloom-inference-server/cli.py", line 44, in main
    model = DSInferenceGRPCServer(args)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/grpc_server.py", line 16, in __init__
    files = os.listdir(args.save_mp_checkpoint_path)
FileNotFoundError: [Errno 2] No such file or directory: 'cache'

if I make the dir:

mkdir cache
python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path cache --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
Traceback (most recent call last):
  File "scripts/bloom-inference-server/cli.py", line 80, in <module>
    main()
  File "scripts/bloom-inference-server/cli.py", line 44, in main
    model = DSInferenceGRPCServer(args)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/inference/Megatron-DeepSpeed/scripts/bloom-inference-server/ds_inference/grpc_server.py", line 23, in __init__
    if ("base_dir" in checkpoints_json):
UnboundLocalError: local variable 'checkpoints_json' referenced before assignment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00 . The caching is not done every time. I think the argument name is misleading. I will change that.
Yeah if the directory already exists but is empty, it leads to an error. Thanks for pointing this out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For caching, one needs to run cache_ds_checkpoints. Will change argument name here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the backend for this is running via DeepSpeed MII.
Which requires re-sharded checkpoints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL @stas00 I dropped the caching functionality since I figured that no one would use it. Especially now, when Microsoft has provided the weights themselves.

Copy link
Collaborator Author

@mayank31398 mayank31398 Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@stas00 stas00 Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see, yes, we should leave it in place. Especially since eventually this will be no longer bloom-176b specific, so having the ability to generate small weights locally would be beneficial.

apologies if I wasn't clear in the first place, I was suggesting to add the support to the pre-sharded weights from the hub, but not dropping the original way.

so if possible please restore those? and of course we want to document that.

That's another way to approach it, is to add to README a code example on how to pre-shard and save the weights - instead of having it implemented in the code.

Whatever you feel is more intuitive to you, @mayank31398

Copy link

@pai4451 pai4451 Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed. Was a bug in DeepSpeed. Thanks @RezaYazdaniAminabadi

@mayank31398 @stas00 Thanks, I can run the ZeroQuant int8 version of bloom now. But the output is repetitive, has it been fixed right now?

in=DeepSpeed is a machine learning framework
out=DeeSpeed is a machine learning framework for deep deep deep deep deep deep deep deep deep deep deep…(repeated)

My generate_kwargs are max_new_tokens=100, do_sample=False.
I installed the latest DeepSpeed from master branch and the versions are

deepspeed==0.7.3+afdc7287
transformers==4.20.1
accelerate==0.12.0

@mayank31398 Any changes did you make to fix this bug? Thanks in advance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, are you on the latest commit (master branch) in DeepSpeed @pai4451 ?
Can you try a different input sentence?

```

#### BLOOM server deployment
1. using HF accelerate
```shell
stas00 marked this conversation as resolved.
Show resolved Hide resolved
python scripts/bloom-inference-server/server.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --host <HOST ADDRESS> --port <PORT> --allowed_max_new_tokens 100
```

2. using DS inference
```shell
python scripts/bloom-inference-server/server.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --host <HOST ADDRESS> --port <PORT> --allowed_max_new_tokens 100
```

We provide an example [script](examples/server_request.py) to query the BLOOM server is provided. To run this script:
```shell
python scripts/bloom-inference-server/examples/server_request.py --host <HOST ADDRESS> --port <PORT>
```

#### Benchmark system for BLOOM inference
1. using HF accelerate
```shell
python scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5
```

2. using DS inference
```shell
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --benchmark_cycles 5
```

3. using DS ZeRO
```shell
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
```

Alternatively, the following shell script will benchmark different batch sizes for the model.
```shell
mkdir -p logs

for bs in {1,2,4,8,16,32,64,128}
do
python scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/hf-$bs.log

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/ds-$bs.log

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/ds-zero-$bs.log
done
```

The following will benchmark sequence length for batch size = 1 on DS inference.
```shell
for sq in {1,10,50,100,200,300,400,500,600,700,800,900,1000,1500,2000,2500,3000,3500,4000,4500,5000}
do
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --batch_size 1 --benchmark_cycles 5 --deployment_framework ds_inference --generate_kwargs '{"do_sample": false, "min_length": '$sq', "max_new_tokens": '$sq'}' 2>&1 | tee logs/ds_$sq.log
done
```
159 changes: 159 additions & 0 deletions scripts/bloom-inference-server/benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
import argparse
import gc
import os

import deepspeed
import torch

import utils
from ds_inference import DSInferenceModel
from ds_zero import DSZeROModel
from hf_accelerate import HFAccelerateModel
from utils import (
BENCHMARK,
DS_INFERENCE,
DS_ZERO,
HF_ACCELERATE,
GenerateRequest,
Model,
get_argument_parser,
get_dummy_batch,
parse_generate_kwargs,
print_rank_n,
run_and_log_time
)


def benchmark_generation(model: Model,
request: GenerateRequest,
cycles: int = 5):
total_new_tokens_generated = 0
for _ in range(cycles):
response = model.generate(request)
total_new_tokens_generated += sum(
new_tokens for new_tokens in response.num_generated_tokens)
return total_new_tokens_generated


def get_benchmark_results(benchmark_time: float,
initialization_time: float,
total_new_tokens_generated: int,
batch_size: int,
cycles: int) -> str:
throughput = total_new_tokens_generated / benchmark_time
latency = benchmark_time / cycles
return f"""
*** Performance stats:
Throughput (including tokenization) = {throughput:.2f} tokens/sec
Throughput (including tokenization) = {1000 / throughput:.2f} msecs/token
Model loading time = {initialization_time:.2f} secs
Total tokens generated = {total_new_tokens_generated} with batch size = {batch_size}
Latency = {latency:.2f} secs
Model loading time + generation time per batch = {initialization_time + latency:.2f} secs
"""


def benchmark_end_to_end(args: argparse.Namespace,
model_class: Model,
zero_activated: bool = False) -> None:
model, initialization_time = run_and_log_time(
(model_class, {"args": args})
)

request = parse_generate_kwargs(
get_dummy_batch(args.batch_size),
args.generate_kwargs
)

print_rank_n(f"generate_kwargs = {args.generate_kwargs}")
print_rank_n(f"batch_size = {args.batch_size}")

# warmup is a must if measuring speed as it's when all the optimizations are performed
# e.g. on 8x80 a100 the first pass of 100 tokens takes 23sec, and the next one is 4secs
response = model.generate(request)

for i, (o, _) in zip(request.text, zip(response.text, response.num_generated_tokens)):
print_rank_n(f"{'-' * 60}\nin = {i}\nout = {o}\n")

if (args.benchmark_cycles > 0):
print_rank_n(f"*** Running benchmark")

torch.cuda.empty_cache()
gc.collect()

# warm up
model.generate(request)
torch.cuda.synchronize()

# benchmark
total_new_tokens_generated, benchmark_time = run_and_log_time(
(
benchmark_generation,
{
"model": model,
"request": request,
"cycles": args.benchmark_cycles
}
)
)

# with ZeRO every GPU is generating batch_size * sequence_length tokens
if (zero_activated):
world_size = int(os.getenv('WORLD_SIZE', '1'))
total_new_tokens_generated *= world_size

print_rank_n(
get_benchmark_results(
benchmark_time,
initialization_time,
total_new_tokens_generated,
args.batch_size,
args.benchmark_cycles
)
)


def get_args() -> argparse.Namespace:
parser = get_argument_parser()

group = parser.add_argument_group(title="launch config")
group.add_argument("--benchmark_cycles", type=int,
default=0, help="additionally run benchmark")
group.add_argument("--local_rank", required=False,
type=int, help="used by dist launchers")
group.add_argument("--batch_size", default=1, type=int, help="batch size")
group.add_argument("--cpu_offload", action="store_true",
help="whether to activate CPU offload for DS ZeRO")

args = utils.get_args(parser, BENCHMARK)

launched_with_deepspeed = args.deployment_framework in [
DS_INFERENCE, DS_ZERO]

if (not launched_with_deepspeed):
assert args.local_rank == None, "local_rank must be None if not launched with DeepSpeed"

if (args.cpu_offload):
assert args.deployment_framework == DS_ZERO, "cpu_offload only works with DS_ZeRO"

return args


def main() -> None:
args = get_args()

if (args.deployment_framework == HF_ACCELERATE):
benchmark_end_to_end(args, HFAccelerateModel)
elif (args.deployment_framework == DS_INFERENCE):
deepspeed.init_distributed("nccl")
benchmark_end_to_end(args, DSInferenceModel)
elif (args.deployment_framework == DS_ZERO):
deepspeed.init_distributed("nccl")
benchmark_end_to_end(args, DSZeROModel, zero_activated=True)
else:
raise ValueError(
f"Unknown deployment framework {args.deployment_framework}")


if (__name__ == "__main__"):
main()
67 changes: 67 additions & 0 deletions scripts/bloom-inference-server/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import argparse
import json
import sys

import utils
from ds_inference import DSInferenceGRPCServer
from hf_accelerate import HFAccelerateModel
from utils import CLI, DS_INFERENCE, HF_ACCELERATE, get_argument_parser, parse_generate_kwargs, print_rank_n


def get_args() -> argparse.Namespace:
parser = get_argument_parser()

group = parser.add_argument_group(title="launch config")
group.add_argument("--shutdown_command", required=False,
type=str, default="__shutdown__", help="This string will exit the script")

args = utils.get_args(parser, CLI)

return args


def main() -> None:
args = get_args()

if (args.deployment_framework == HF_ACCELERATE):
model = HFAccelerateModel(args)
elif (args.deployment_framework == DS_INFERENCE):
model = DSInferenceGRPCServer(args)
else:
raise ValueError(
f"Unknown deployment framework {args.deployment_framework}")

generate_kwargs = args.generate_kwargs

while (True):
try:
input_text = input("Input text: ")

if (input_text == args.shutdown_command):
model.shutdown()

if (input("change generate_kwargs? [y/n] ") == "y"):
while (True):
try:
generate_kwargs = json.loads(
input("Generate kwargs: "))
break
except KeyboardInterrupt:
model.shutdown()
except Exception as e:
e_type, e_message, _ = sys.exc_info()
print("error =", e_type.__name__)
print("message =", e_message)
continue

request = parse_generate_kwargs([input_text], generate_kwargs)
response = model.generate(request)

print_rank_n("Output text:", response.text[0])
print_rank_n("Generated tokens:", response.num_generated_tokens[0])
except KeyboardInterrupt:
model.shutdown()


if (__name__ == "__main__"):
main()
2 changes: 2 additions & 0 deletions scripts/bloom-inference-server/ds_inference/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .grpc_server import DSInferenceGRPCServer
from .model import DSInferenceModel
Loading