Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random token-generation issue + MP-checkpoint loading/saving #2132

Merged
merged 29 commits into from
Jul 29, 2022

Conversation

RezaYazdaniAminabadi
Copy link
Contributor

This PR fixes the token-generation issue with different random seed on several MP ranks. It also adds the ability to load/save MP-partitioned checkpoints to speed up the checkpoint loading for inference.

cc: @stas00 @jeffra

@RezaYazdaniAminabadi
Copy link
Contributor Author

I experience this error when using this branch with bloom:

Note: replace_with_kernel_inject is True

TypeError: get_sd_loader_json() missing 1 required positional argument: 'checkpoint_engine'

Traceback (most recent call last):
  File "bloom-ds-inference.py", line 373, in <module>
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 289, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 326, in _apply_injection_policy
    checkpoint, ckpt_type, ckpt_name, ckpt_mp_size = SDLoaderFactory.get_sd_loader_json(

Just fixed it, please give it a try

@zcrypt0
Copy link

zcrypt0 commented Jul 28, 2022

@RezaYazdaniAminabadi

Looks like I need to add a new key to my checkpoint.json?

Is it mandatory? What value should I put in it for the huggingface checkpoint file list?

EDIT: I looked at the code and set it to pp and got past the error.

KeyError: 'parallelization'
Traceback (most recent call last):
  File "bloom-ds-inference.py", line 374, in <module>
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 289, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 332, in _apply_injection_policy
    replace_transformer_layer(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 789, in replace_transformer_layer
    ckpt_type = checkpoint_dict['parallelization']
KeyError: 'parallelization'

@zcrypt0
Copy link

zcrypt0 commented Jul 28, 2022

After getting past the parallelization key error, I saved the tp checkpoints successfully, but hit this error when trying to load them (the new checkpoints json file is passed in correctly):

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.cache/deepspeed/bigscience/bloom/BLOOM-176B-non-tp.pt'

The -tp_XX.pt files all exist in the directory as expected, but this -non-tp.pt file doesn't exist there.

My new checkpoints json file looks like this:

{"type": "BLOOM-176B",
"base_dir": "/home/ubuntu/.cache/deepspeed/bigscience/bloom", 
"checkpoints": ["BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-tp_00.pt", "BLOOM-176B-tp_01.pt", "BLOOM-176B-tp_02.pt", "BLOOM-176B-tp_03.pt", "BLOOM-176B-tp_04.pt", "BLOOM-176B-tp_05.pt", "BLOOM-176B-tp_06.pt", "BLOOM-176B-tp_07.pt"], 
"version": 1.0, 
"parallelization": "tp", 
"mp_size": 8}

EDIT: I found the -non-tp.pt file in my home directory rather than the cache directory but it is only 879 bytes surprisingly. I copied it to the cache directory and now I get this error when I run:

Loading 2 checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "bloom-ds-inference.py", line 388, in <module>
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 289, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 332, in _apply_injection_policy
    replace_transformer_layer(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 816, in replace_transformer_layer
    load_model_with_checkpoint(replaced_module,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 162, in load_model_with_checkpoint
    load_module_recursive(r_module)
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 157, in load_module_recursive
    load_module_recursive(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 155, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 27, in load
    module.weight = mp_replace.copy(module.weight.data,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 92, in copy
    dst.data.copy_(src)
NotImplementedError: Cannot copy out of meta tensor; no data!

v in dict(replaced_module.state_dict()).items()
if transformer_name not in k
}),
non_tp_ckpt_name)
Copy link

@zcrypt0 zcrypt0 Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f'{save_mp_checkpoint_path}/{non_tp_ckpt_name}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's true, it's not saved correctly. I am gonna fix it now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I just fixed it, you should see the following files and sizes under the save_path:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcrypt0, it also generates a config file under the same path that you can use to run inference with

Copy link

@zcrypt0 zcrypt0 Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: I just realized the change in the non-tp file size, I will give it a try soon

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RezaYazdaniAminabadi Just tested and it works without a hitch, nice! 👍

@mayank31398
Copy link
Contributor

mayank31398 commented Jul 29, 2022

Still getting this error @RezaYazdaniAminabadi
I am on master branch now

llm-test-cluster-9:1727013:1729609 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

running with batch size = 1

@mayank31398
Copy link
Contributor

Still getting this error @RezaYazdaniAminabadi I am on master branch now

llm-test-cluster-9:1727013:1729609 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

running with batch size = 1

Ran this with CUDA 11.6 with DeepSpeed on master branch.
This resolved the issue.

@mayank31398
Copy link
Contributor

mayank31398 commented Aug 1, 2022

@RezaYazdaniAminabadi how much time is cached tp model loading supposed to take?
It shows Loading 2 shards for me and after the progress bar is complete, it seems to be stuck
This is how I am using it:

checkpoints_json = os.path.join(
    args.mp_cached_model_path, "BLOOM-176B_ds-inference_config.json")

self.model = deepspeed.init_inference(
    self.model,
    mp_size=world_size,
    dtype=args.dtype,
    checkpoint=checkpoints_json,
    replace_with_kernel_inject=True
)

self.model is loaded using HF AutoModel as in bloom-ds-inference.py

@mayank31398
Copy link
Contributor

nvm
I switched to NVME drive, it took 2 mins and on a hard drive it took 12 mins. Didn't know it could make that much of a difference.

@mayank31398
Copy link
Contributor

@RezaYazdaniAminabadi I am seeing

NotImplementedError: Cannot copy out of meta tensor; no data!

again after updating to master branch and saving without providing checkpoint json

@jeffra
Copy link
Collaborator

jeffra commented Aug 21, 2022

@RezaYazdaniAminabadi I am seeing

NotImplementedError: Cannot copy out of meta tensor; no data!

again after updating to master branch and saving without providing checkpoint json

Just want to double check, did your install include this commit in master? #2237

@mayank31398
Copy link
Contributor

mayank31398 commented Aug 21, 2022

@jeffra yes I am on the latest commit

@mayank31398
Copy link
Contributor

mayank31398 commented Aug 21, 2022

I use this code:

run with :

deepspeed --num_gpus 8 scripts/bloom-inference-server/cache_ds_checkpoints.py --model_name bigscience/bloom --dtype fp16 --save_mp_checkpoint_path ../DS_cache
import argparse
import os

import deepspeed
import torch
from transformers import AutoConfig, AutoModelForCausalLM


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()

    group = parser.add_argument_group(title="launch config")
    group.add_argument("--local_rank", required=False,
                       type=int, help="used by dist launchers")
    group.add_argument("--save_mp_checkpoint_path", required=True,
                       type=str, help="MP checkpoints path for DS inference")

    group = parser.add_argument_group(title="model")
    group.add_argument("--model_name", type=str,
                       required=True, help="model to use")
    group.add_argument("--dtype", type=str, required=True,
                       choices=["bf16", "fp16"], help="dtype for model")

    args = parser.parse_args()

    if (args.dtype == "bf16"):
        args.dtype = torch.bfloat16
    elif (args.dtype == "fp16"):
        args.dtype = torch.float16

    return args


def main() -> None:
    args = get_args()

    if (args.local_rank == 0):
        print("Loading model...")
    world_size = int(os.getenv("WORLD_SIZE", "1"))

    # Load model
    with deepspeed.OnDevice(dtype=args.dtype, device="meta"):
        model = AutoModelForCausalLM.from_config(
            AutoConfig.from_pretrained(args.model_name),
            torch_dtype=torch.bfloat16
        )
    model = model.eval()

    if (args.dtype == torch.float16):
        model = deepspeed.init_inference(
            model,
            mp_size=world_size,
            dtype=args.dtype,
            replace_with_kernel_inject=True,
            save_mp_checkpoint_path=args.save_mp_checkpoint_path
        )
    elif (args.dtype == torch.bfloat16):
        raise NotImplementedError("bfloat16 is not yet supported")

    print("Model loaded")


if (__name__ == "__main__"):
    main()

@mayank31398
Copy link
Contributor

mayank31398 commented Aug 22, 2022

@jeffra This issue is blocking bigscience-workshop/Megatron-DeepSpeed#328

@mayank31398
Copy link
Contributor

@jeffra ^^

@pai4451
Copy link

pai4451 commented Aug 30, 2022

@RezaYazdaniAminabadi I am seeing

NotImplementedError: Cannot copy out of meta tensor; no data!

again after updating to master branch and saving without providing checkpoint json

@mayank31398 Hi, also facing this issue. Is it solved right now?

@mayank31398
Copy link
Contributor

@pai4451 not yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants