Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? #2301

Closed
xk503775229 opened this issue Sep 7, 2022 · 10 comments
Assignees
Labels
enhancement New feature or request inference

Comments

@xk503775229
Copy link

i just use deepspeed ZeroQuant to compress my model ,but i dont known how to use deepspeed to inference it .Is there any discribe about it ?

@xk503775229
Copy link
Author

Is there any guide to running inference on compressed models(especially ZeroQuant)?
Any help would be appreciated.

@xk503775229 xk503775229 changed the title questuion : how to inference Int8 models supported through ZeroQuant technology ? questuion : how to inference Int8 models (GPT) supported through ZeroQuant technology ? Sep 7, 2022
@mayank31398
Copy link
Contributor

mayank31398 commented Sep 7, 2022

@xk503775229 bigscience-workshop/Megatron-DeepSpeed#339
This PR adds support for BLOOM ds-inference with fp16 and int8.
The README is not up-to-date yet. I will work on fixing that.

@xk503775229
Copy link
Author

@mayank31398 when i use the BLOOM way to load my checkpoint ,
image

GPT2 checkpoint type is not supported

I was trying out the compression library for ZeroQuant quantization (for GPT-2 model). While I was able to compress the model, I didn't see any throughput/latency gain from the quantization during inference
Any help would be appreciated.

@mayank31398
Copy link
Contributor

Can you share a code snippet you used for loading GPT?
Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8 CUDA kernels will be released later which are much faster than fp16.

@xk503775229
Copy link
Author

Many thanks
The following is my code snippet used for loading GPT.
image

checkpoint.json:

{"type": "GPT2", "checkpoints": ["/root/DeepSpeedExamples/model_compression/gpt2/output/W8A8/pytorch_model.bin"], "version": 1.0}

@mayank31398
Copy link
Contributor

mayank31398 commented Sep 16, 2022

In general, the code is only supposed to work with Megatron checkpoints.
But there is an exception for BLOOM.
Not sure about the reason.
@jeffra can you comment?
I am not sure, I see the following in the DeepSpeed code:

def get_sd_loader_json(json_file, checkpoint_engine):
if isinstance(json_file, str):
with open(json_file) as f:
data = json.load(f)
else:
assert isinstance(json_file, dict)
data = json_file
sd_type = data['type']
ckpt_list = data['checkpoints']
version = data['version']
ckpt_type = data.get('parallelization', 'pp')
mp_size = data.get('mp_size', 0)
if 'bloom' in sd_type.lower():
return data
return SDLoaderFactory.get_sd_loader(ckpt_list,
checkpoint_engine,
sd_type,
version)
@staticmethod
def get_sd_loader(ckpt_list, checkpoint_engine, sd_type='Megatron', version=None):
if sd_type == 'Megatron':
return MegatronSDLoader(ckpt_list, version, checkpoint_engine)
else:
assert False, '{} checkpoint type is not supported'.format(sd_type)

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @xk503775229,

Thanks for the interest in trying Int8 for other models. In general, you should be able to do so, however, one issue here is that you want to use this with loading checkpoint, which is not currently supported for all models. Regarding the Int8 inference, have you tried using the init_inference simply with passing int8 without a checkpoint json (let the model be loaded as it was originally with fp16)?
Thanks,
Reza

@david-macleod
Copy link

Can you share a code snippet you used for loading GPT? Also, currently, DS-inference uses fp16 special CUDA kernels for inference which is not the case for int8. int8 CUDA kernels will be released later which are much faster than fp16.

hi, is there a timeline for the release of the int8 CUDA kernels?

@awan-10 awan-10 added the enhancement New feature or request label Oct 28, 2022
@yaozhewei
Copy link
Contributor

Hi This will be released as a part of (MII-Azure) later: https://github.com/microsoft/DeepSpeed-MII

@yaozhewei
Copy link
Contributor

Closed for now. Please re-open it if you need further assistance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request inference
Projects
None yet
Development

No branches or pull requests

8 participants