Skip to content

Commit

Permalink
Followup PR for adding generation-server (#339)
Browse files Browse the repository at this point in the history
* fix grpc
* use functools.partial
* fix ds-inference server
* add support for int8
* update README
* fix bugs

Co-authored-by: Mayank Mishra <mayank31398@gmail.com>
  • Loading branch information
mayank31398 and mayank31398 authored Sep 11, 2022
1 parent 479aac3 commit cd597c8
Show file tree
Hide file tree
Showing 12 changed files with 127 additions and 128 deletions.
53 changes: 14 additions & 39 deletions scripts/bloom-inference-server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,7 @@ We support HuggingFace accelerate and DeepSpeed Inference for generation.
Install required packages:

```shell
pip install fastapi uvicorn accelerate huggingface_hub>=0.9.0
```
To install [DeepSpeed](https://github.com/microsoft/DeepSpeed):
```shell
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
CFLAGS="-I$CONDA_PREFIX/include/" LDFLAGS="-L$CONDA_PREFIX/lib/" TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
pip install fastapi uvicorn accelerate huggingface_hub>=0.9.0 deepspeed>=0.7.3
```
To install [DeepSpeed-MII](https://github.com/microsoft/DeepSpeed-MII):
```shell
Expand All @@ -19,14 +13,9 @@ cd DeepSpeed-MII
pip install .
```

All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B. These scripts might not work for other models or a different number of GPUs.
DS inference only supports fp16 for cli and server application. However, for benchmarking, it supports both fp16 and bf16. bf16 support will be added once DeepSpeed adds suitable CUDA kernels for these.
All the provided scripts are tested on 8 A100 80GB GPUs for BLOOM 176B (fp16/bf16) and 4 A100 80GB GPUs for BLOOM 176B (int8). These scripts might not work for other models or a different number of GPUs.

DS inference is deployed using the DeepSpeed MII library which requires the resharded checkpoints for 8 x Tensor Parallel. The HuggingFace checkpoints can be resharded and cached using the following command:
```shell
deepspeed --num_gpus 8 scripts/bloom-inference-server/cache_ds_checkpoints.py --model_name bigscience/bloom --dtype fp16 --save_mp_checkpoint_path <PATH TO DS CACHED MODEL>
```
Note: Running the above script will consume ~350 GB of disk space and will take some time (~30 minutes), depending on both the speed of your GPUs and storage.
DS inference is deployed using the DeepSpeed MII library which requires the resharded checkpoints for 8 x Tensor Parallel.

Note: sometimes GPU memory is not freed when DS inference deployment is shutdown. You can free this memory by running:
```python
Expand All @@ -35,6 +24,10 @@ mii.terminate("ds_inference_grpc_server")
```
or alternatively, just doing a `killall python` in terminal.

For using BLOOM quantized, use dtype = int8. Also, change the model_name to microsoft/bloom-deepspeed-inference-int8 for DeepSpeed-Inference. For HF accelerate, no change is needed for model_name.

HF accelerate uses [LLM.int8()](https://arxiv.org/abs/2208.07339) and DS-inference uses [ZeroQuant](https://arxiv.org/abs/2206.01861) for post-training quantization.

#### BLOOM inference via command-line
This asks for generate_kwargs everytime.
Example: generate_kwargs =
Expand All @@ -49,7 +42,7 @@ python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dty

2. using DS inference
```shell
python scripts/bloom-inference-server/cli.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
python scripts/bloom-inference-server/cli.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --generate_kwargs '{"min_length": 100, "max_new_tokens": 100, "do_sample": false}'
```

#### BLOOM server deployment
Expand All @@ -60,7 +53,7 @@ python scripts/bloom-inference-server/server.py --model_name bigscience/bloom --

2. using DS inference
```shell
python scripts/bloom-inference-server/server.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --host <HOST ADDRESS> --port <PORT> --allowed_max_new_tokens 100
python scripts/bloom-inference-server/server.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --host <HOST ADDRESS> --port <PORT> --allowed_max_new_tokens 100
```

We provide an example [script](examples/server_request.py) to query the BLOOM server is provided. To run this script:
Expand All @@ -76,32 +69,14 @@ python scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom

2. using DS inference
```shell
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --benchmark_cycles 5
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
```

3. using DS ZeRO
alternatively, to load model faster:
```shell
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name microsoft/bloom-deepspeed-inference-fp16 --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5
```

Alternatively, the following shell script will benchmark different batch sizes for the model.
```shell
mkdir -p logs

for bs in {1,2,4,8,16,32,64,128}
do
python scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/hf-$bs.log

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --deployment_framework ds_inference --save_mp_checkpoint_path <PATH TO DS CACHED MODEL> --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/ds-$bs.log

deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5 --batch_size $bs 2>&1 | tee logs/ds-zero-$bs.log
done
```

The following will benchmark sequence length for batch size = 1 on DS inference.
3. using DS ZeRO
```shell
for sq in {1,10,50,100,200,300,400,500,600,700,800,900,1000,1500,2000,2500,3000,3500,4000,4500,5000}
do
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype fp16 --batch_size 1 --benchmark_cycles 5 --deployment_framework ds_inference --generate_kwargs '{"do_sample": false, "min_length": '$sq', "max_new_tokens": '$sq'}' 2>&1 | tee logs/ds_$sq.log
done
deepspeed --num_gpus 8 scripts/bloom-inference-server/benchmark.py --model_name bigscience/bloom --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5
```
15 changes: 8 additions & 7 deletions scripts/bloom-inference-server/benchmark.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import argparse
import gc
import os
from functools import partial

import deepspeed
import torch
Expand Down Expand Up @@ -57,14 +58,16 @@ def benchmark_end_to_end(args: argparse.Namespace,
model_class: Model,
zero_activated: bool = False) -> None:
model, initialization_time = run_and_log_time(
(model_class, {"args": args})
partial(model_class, args=args)
)

request = parse_generate_kwargs(
get_dummy_batch(args.batch_size),
args.generate_kwargs
)

request.preprocess()

print_rank_n(f"generate_kwargs = {args.generate_kwargs}")
print_rank_n(f"batch_size = {args.batch_size}")

Expand All @@ -87,13 +90,11 @@ def benchmark_end_to_end(args: argparse.Namespace,

# benchmark
total_new_tokens_generated, benchmark_time = run_and_log_time(
(
partial(
benchmark_generation,
{
"model": model,
"request": request,
"cycles": args.benchmark_cycles
}
model=model,
request=request,
cycles=args.benchmark_cycles
)
)

Expand Down
3 changes: 3 additions & 0 deletions scripts/bloom-inference-server/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ def main() -> None:
continue

request = parse_generate_kwargs([input_text], generate_kwargs)

request.preprocess()

response = model.generate(request)

print_rank_n("Output text:", response.text[0])
Expand Down
59 changes: 28 additions & 31 deletions scripts/bloom-inference-server/ds_inference/grpc_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,59 +6,59 @@
from transformers import AutoTokenizer

import mii
from utils import GenerateRequest, GenerateResponse, Model, get_filter_dict, get_str_dtype, print_rank_n
from utils import (
GenerateRequest,
GenerateResponse,
Model,
get_downloaded_model_path,
get_filter_dict,
get_str_dtype,
print_rank_n
)


class DSInferenceGRPCServer(Model):
def __init__(self, args: argparse.Namespace) -> None:
self.deployment_name = "ds_inference_grpc_server"

files = os.listdir(args.save_mp_checkpoint_path)
for file in files:
if (file.endswith(".json")):
checkpoints_json = json.load(
open(os.path.join(args.save_mp_checkpoint_path, file), "r"))
break
downloaded_model_path = get_downloaded_model_path(args.model_name)

if ("base_dir" in checkpoints_json):
del checkpoints_json["base_dir"]
self.tokenizer = AutoTokenizer.from_pretrained(downloaded_model_path)
self.pad = self.tokenizer.pad_token_id

if (args.dtype in [torch.float16, torch.int8]):
checkpoints_json = os.path.join(
downloaded_model_path, "ds_inference_config.json")

if (args.dtype == torch.float16):
mii.deploy(
task="text-generation",
model=args.model_name,
# should pass args.model_name but can't since the new
# weights are not supported yet. So, this is a hack
model="bigscience/bloom",
deployment_name=self.deployment_name,
model_path=downloaded_model_path,
mii_config={
"dtype": get_str_dtype(args.dtype),
"tensor_parallel": 8,
"port_number": 50950,
"checkpoint_dict": checkpoints_json
},
model_path=args.save_mp_checkpoint_path
"checkpoint_dict": json.load(open(checkpoints_json, "r"))
}
)
else:
raise NotImplementedError("This is not yet supported")
elif (args.dtype == torch.bfloat16):
raise NotImplementedError("bfloat16 is not yet supported")

self.tokenizer = AutoTokenizer.from_pretrained(args.model_name)
self.pad = self.tokenizer.pad_token_id
self.model = mii.mii_query_handle(self.deployment_name)

def generate(self, request: GenerateRequest) -> GenerateResponse:
text = request.text

return_type = type(text)
if (return_type == str):
text = [text]

output_text = self.model.query(
{"query": text},
{"query": request.text},
**get_filter_dict(request)
).response

output_text = [_ for _ in output_text]

# Remove input from output
input_tokens = self.tokenizer(text).input_ids
input_tokens = self.tokenizer(request.text).input_ids
output_tokens = self.tokenizer(output_text).input_ids

input_token_lengths = [len(x) for x in input_tokens]
Expand All @@ -72,10 +72,6 @@ def generate(self, request: GenerateRequest) -> GenerateResponse:
output_text = self.tokenizer.batch_decode(
output_tokens, skip_special_tokens=True)

if (return_type == str):
output_text = output_text[0]
num_generated_tokens = num_generated_tokens[0]

return GenerateResponse(
text=output_text,
num_generated_tokens=num_generated_tokens
Expand All @@ -87,4 +83,5 @@ def shutdown(self) -> None:
try:
mii.terminate(self.deployment_name)
except Exception:
exit()
pass
exit()
14 changes: 5 additions & 9 deletions scripts/bloom-inference-server/ds_inference/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@
import json
import os
from argparse import Namespace
from functools import partial

import deepspeed
import torch
import torch.distributed as dist
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

from utils import Model, get_downloaded_model_path, print_rank_n, run_rank_n
Expand Down Expand Up @@ -58,6 +60,7 @@ def __init__(self, args: Namespace) -> None:
self.input_device = torch.cuda.current_device()

print_rank_n("Model loaded")
dist.barrier()


class TemporaryCheckpointsJSON:
Expand All @@ -77,17 +80,10 @@ def write_checkpoints_json(self, model_path: str) -> None:

def __enter__(self):
run_rank_n(
os.makedirs,
{
"name": self.tmp_directory,
"exist_ok": True
}
partial(os.makedirs, name=self.tmp_directory, exist_ok=True)
)
run_rank_n(
self.write_checkpoints_json,
{
"model_path": self.model_path
},
partial(self.write_checkpoints_json, model_path=self.model_path),
barrier=True
)
return self.tmp_file
Expand Down
2 changes: 2 additions & 0 deletions scripts/bloom-inference-server/ds_zero/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

import deepspeed
import torch
import torch.distributed as dist
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
from transformers.deepspeed import HfDeepSpeedConfig

Expand Down Expand Up @@ -64,3 +65,4 @@ def __init__(self, args: Namespace) -> None:
self.input_device = torch.cuda.current_device()

print_rank_n("Model loaded")
dist.barrier()
45 changes: 32 additions & 13 deletions scripts/bloom-inference-server/hf_accelerate/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,20 @@ def __init__(self, args: Namespace) -> None:
self.tokenizer = AutoTokenizer.from_pretrained(downloaded_model_path)
self.pad = self.tokenizer.pad_token_id

self.model = AutoModelForCausalLM.from_pretrained(
downloaded_model_path,
device_map="auto",
max_memory=get_max_memory_per_gpu_dict(
args.dtype, args.model_name),
torch_dtype=args.dtype
)
kwargs = {
"pretrained_model_name_or_path": downloaded_model_path,
"device_map": "auto",
"max_memory": get_max_memory_per_gpu_dict(
args.dtype,
args.model_name
)
}
if (args.dtype == torch.int8):
kwargs["load_in_8bit"] = True
else:
kwargs["torch_dtype"] = args.dtype

self.model = AutoModelForCausalLM.from_pretrained(**kwargs)

self.model.requires_grad_(False)
self.model.eval()
Expand All @@ -39,14 +46,20 @@ def get_max_memory_per_gpu_dict(dtype, model_name):
if model_name == "bigscience/bloom" and n_gpus == 8 and torch.cuda.get_device_properties(0).total_memory > 79*2**30:
# hand crafted optimized memory map for 8x80 setup over BLOOM
# this works with bs=40
return {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB', 4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}

if (dtype in [torch.bfloat16, torch.float16]):
max_memory_per_gpu = {0: '0GIB', 1: '51GIB', 2: '51GIB', 3: '51GIB',
4: '51GIB', 5: '51GIB', 6: '51GIB', 7: '51GIB'}
elif (dtype == torch.int8):
max_memory_per_gpu = {0: '0GIB', 1: '26GIB', 2: '26GIB', 3: '26GIB',
4: '26GIB', 5: '26GIB', 6: '26GIB', 7: '26GIB'}
print_rank_n("Max memory per gpu:", max_memory_per_gpu)
return max_memory_per_gpu
try:
# model_params calculation, as we don't have a model yet to do:
#model_params = sum(dict((p.data_ptr(), p.numel()) for p in model.parameters()).values())

config = AutoConfig.from_pretrained(model_name)
h = config.n_embed
h = config.hidden_size
l = config.n_layer
v = config.vocab_size
# from https://github.com/bigscience-workshop/bigscience/tree/6917a3b5fefcf439d3485ca184b4d9f6ab605150/math#model-sizing
Expand All @@ -56,11 +69,14 @@ def get_max_memory_per_gpu_dict(dtype, model_name):
f"The model {model_name} has a broken config file. Please notify the owner")
raise

bytes = torch.finfo(dtype).bits / 8
if (dtype == torch.int8):
bytes = 1
else:
bytes = torch.finfo(dtype).bits / 8
param_memory_total_in_bytes = model_params * bytes
# add 5% since weight sizes aren't the same and some GPU may need more memory
param_memory_per_gpu_in_bytes = int(
param_memory_total_in_bytes / n_gpus * 1.05)
param_memory_total_in_bytes / n_gpus * 1.10)
print_rank_n(
f"Estimating {param_memory_per_gpu_in_bytes/2**30:0.2f}GB per gpu for weights")

Expand All @@ -72,4 +88,7 @@ def get_max_memory_per_gpu_dict(dtype, model_name):
raise ValueError(
f"Unable to generate the memory map automatically as the needed estimated memory per gpu ({param_memory_per_gpu_in_bytes/2**30:0.2f}GB) is bigger than the available per gpu memory ({max_memory_per_gpu_in_bytes/2**30:0.2f}GB)")

return {i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
max_memory_per_gpu = {
i: param_memory_per_gpu_in_bytes for i in range(torch.cuda.device_count())}
print("Max memory per gpu:", max_memory_per_gpu)
return max_memory_per_gpu
Loading

0 comments on commit cd597c8

Please sign in to comment.