Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ds-inference Int8 support through ZeroQuant technology #2217

Merged
merged 25 commits into from
Aug 30, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
cf2fe01
Fix the layer-past for GPT based models
Aug 8, 2022
c2cf304
add the Int8 support for ds-inference using ZeroQuant technology
Aug 13, 2022
d98f1f9
fixing some issue with loading checkpoint and bias-add
Aug 15, 2022
ebc82bb
adding the logic to store/restore scale for INT8 checkpoint
Aug 15, 2022
43a7023
add empty quantization scale for different models to run with fp16
Aug 15, 2022
00aa188
Empty-Commit
Aug 15, 2022
9bed645
Merge branch 'master' into ds-inference/ZeroQuant-Int8
RezaYazdaniAminabadi Aug 15, 2022
84e0d03
fix sevral issues after merging with master
Aug 18, 2022
f6cb028
several fixes for generating the INT8 sharded checkpoint
Aug 19, 2022
d47bea6
Merge branch 'master' into ds-inference/ZeroQuant-Int8
RezaYazdaniAminabadi Aug 19, 2022
cb72d9c
move quantizer declaration before inference branch
Aug 20, 2022
32b9322
Merge branch 'master' into ds-inference/ZeroQuant-Int8
RezaYazdaniAminabadi Aug 24, 2022
57779ef
fixing some part to catch up with latest update on HF side
Aug 24, 2022
f4e48e6
Merge branch 'ds-inference/ZeroQuant-Int8' of github.com:microsoft/De…
Aug 24, 2022
dbcb6ec
reducing the CPU memory usage when loading checkpoint (this solves th…
Aug 25, 2022
cd80ecc
some minor modification to the ckpt names
Aug 25, 2022
82a37d6
remove masking and some configuration changes
Aug 26, 2022
9d12656
remove dead code
Aug 26, 2022
4ae356e
Merge branch 'master' into ds-inference/ZeroQuant-Int8
jeffra Aug 26, 2022
d7ff364
Merge branch 'master' into ds-inference/ZeroQuant-Int8
RezaYazdaniAminabadi Aug 28, 2022
b17a3b5
fix some issue with int8 ckpt-loading
Aug 28, 2022
a541e52
Merge branch 'master' into ds-inference/ZeroQuant-Int8
RezaYazdaniAminabadi Aug 29, 2022
2845bad
Merge branch 'master' into ds-inference/ZeroQuant-Int8
RezaYazdaniAminabadi Aug 30, 2022
c77f5e0
Merge branch 'master' into ds-inference/ZeroQuant-Int8
RezaYazdaniAminabadi Aug 30, 2022
f3f4b1d
change the mp_size to tp_size at inference config & add some doc-stri…
Aug 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion deepspeed/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,19 @@ def init_inference(model,
of groups used in quantization. A tuple is passed in if we want to mention that there is extra-grouping
for the MLP part of a Transformer layer (e.g. (True, 8) shows we quantize the model using 8 groups for
all the network except the MLP part that we use 8 extra grouping).
replace_with_kernel_inject: If set we inject kernel as we initialize the inference-engine
replace_with_kernel_inject: this flag need to be set to true to inject inference kernels for models such as, Bert, GPT2, GPT-Neo and GPT-J. Otherwise,
the injection_dict provides the names of two linear layers as a tuple: (attention_output projection, transformer output projection)
return_tuple: Specify whether or not the transformer layers need to return a tuple or a Tensor. It is set to True by default (returning a tuple).
ep_size: The expert-parallelism size which is used for partitioning the experts across the GPUs in the expert-parallel group.
moe: Specify if the type of Transformer is MoE. It is set to False by default.
moe_experts: The global number of experts used in an MoE layer.
moe_type: Specify the type of MoE layer. We have two types of MoE layer: 'Standard' and 'Residual'. It is set to 'Standard' type by default.
args: All the arguments used for launching the inference api that can be useful at the inference-engine for injecting the optimizations.
enable_cuda_graph: use this flag for capturing the CUDA-Graph of the inference ops, so that it can run faster using the graph replay method,
this is set to False by default
save_mp_checkpoint_path: The path for which we want to save the loaded model with a checkpoint. This feature is used for adjusting the
parallelism degree to help alleviate the model loading overhead. It does not save any new checkpoint if no path is passed.
base_dir: This shows the root directory under which all the checkpoint files exists. This can be passed through the json config too.

Returns:
A deepspeed.InferenceEngine wrapped model.
Expand Down
5 changes: 3 additions & 2 deletions deepspeed/module_inject/replace_module.py
Original file line number Diff line number Diff line change
Expand Up @@ -848,7 +848,8 @@ def replace_fn(child, _policy, layer_id=0):
checkpoint = checkpoint_dict['checkpoints']
ckpt_list = checkpoint["tp"] if type(checkpoint) is dict else checkpoint
ckpt_type = checkpoint_dict.get('parallelization', 'pp')
ckpt_mp_size = checkpoint_dict.get('mp_size', len(ckpt_list))
ckpt_mp_size = checkpoint_dict.get('tp_size', len(ckpt_list))
ckpt_mp_size = checkpoint_dict.get('mp_size', ckpt_mp_size)
base_dir1 = checkpoint_dict.get('base_dir', base_dir)

if ckpt_type == 'pp' and type(checkpoint) is list:
Expand Down Expand Up @@ -969,7 +970,7 @@ def replace_fn(child, _policy, layer_id=0):
1.0,
'parallelization':
'tp',
'mp_size':
'tp_size':
world_size,
'dtype':
'int8' if quantize else ('float16' if fp16 else 'float32')
Expand Down