Fix random token-generation issue + MP-checkpoint loading/saving #2132

RezaYazdaniAminabadi · 2022-07-25T22:42:31Z

This PR fixes the token-generation issue with different random seed on several MP ranks. It also adds the ability to load/save MP-partitioned checkpoints to speed up the checkpoint loading for inference.

cc: @stas00 @jeffra

…ed into ds-inference/bloom-fix

RezaYazdaniAminabadi · 2022-07-28T19:44:56Z

I experience this error when using this branch with bloom:

Note: replace_with_kernel_inject is True

TypeError: get_sd_loader_json() missing 1 required positional argument: 'checkpoint_engine'

Traceback (most recent call last):
  File "bloom-ds-inference.py", line 373, in <module>
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 289, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 326, in _apply_injection_policy
    checkpoint, ckpt_type, ckpt_name, ckpt_mp_size = SDLoaderFactory.get_sd_loader_json(

Just fixed it, please give it a try

zcrypt0 · 2022-07-28T20:38:12Z

@RezaYazdaniAminabadi

Looks like I need to add a new key to my checkpoint.json?

Is it mandatory? What value should I put in it for the huggingface checkpoint file list?

EDIT: I looked at the code and set it to pp and got past the error.

KeyError: 'parallelization'
Traceback (most recent call last):
  File "bloom-ds-inference.py", line 374, in <module>
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 289, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 332, in _apply_injection_policy
    replace_transformer_layer(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 789, in replace_transformer_layer
    ckpt_type = checkpoint_dict['parallelization']
KeyError: 'parallelization'

zcrypt0 · 2022-07-28T21:05:04Z

After getting past the parallelization key error, I saved the tp checkpoints successfully, but hit this error when trying to load them (the new checkpoints json file is passed in correctly):

FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.cache/deepspeed/bigscience/bloom/BLOOM-176B-non-tp.pt'

The -tp_XX.pt files all exist in the directory as expected, but this -non-tp.pt file doesn't exist there.

My new checkpoints json file looks like this:

{"type": "BLOOM-176B",
"base_dir": "/home/ubuntu/.cache/deepspeed/bigscience/bloom", 
"checkpoints": ["BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-non-tp.pt", "BLOOM-176B-tp_00.pt", "BLOOM-176B-tp_01.pt", "BLOOM-176B-tp_02.pt", "BLOOM-176B-tp_03.pt", "BLOOM-176B-tp_04.pt", "BLOOM-176B-tp_05.pt", "BLOOM-176B-tp_06.pt", "BLOOM-176B-tp_07.pt"], 
"version": 1.0, 
"parallelization": "tp", 
"mp_size": 8}

EDIT: I found the -non-tp.pt file in my home directory rather than the cache directory but it is only 879 bytes surprisingly. I copied it to the cache directory and now I get this error when I run:

Loading 2 checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
  File "bloom-ds-inference.py", line 388, in <module>
    model = deepspeed.init_inference(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 289, in init_inference
    engine = InferenceEngine(model,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 136, in __init__
    self._apply_injection_policy(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 332, in _apply_injection_policy
    replace_transformer_layer(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 816, in replace_transformer_layer
    load_model_with_checkpoint(replaced_module,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 162, in load_model_with_checkpoint
    load_module_recursive(r_module)
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 157, in load_module_recursive
    load_module_recursive(
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 155, in load_module_recursive
    layer_policies[child.__class__](child, prefix + name + '.')
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/load_checkpoint.py", line 27, in load
    module.weight = mp_replace.copy(module.weight.data,
  File "/home/ubuntu/venv/lib/python3.8/site-packages/deepspeed/module_inject/replace_module.py", line 92, in copy
    dst.data.copy_(src)
NotImplementedError: Cannot copy out of meta tensor; no data!

zcrypt0 · 2022-07-28T21:20:44Z

deepspeed/module_inject/replace_module.py

+                    v in dict(replaced_module.state_dict()).items()
+                    if transformer_name not in k
+                }),
+                non_tp_ckpt_name)


f'{save_mp_checkpoint_path}/{non_tp_ckpt_name}'

that's true, it's not saved correctly. I am gonna fix it now

okay, I just fixed it, you should see the following files and sizes under the save_path:

@zcrypt0, it also generates a config file under the same path that you can use to run inference with

EDIT: I just realized the change in the non-tp file size, I will give it a try soon

@RezaYazdaniAminabadi Just tested and it works without a hitch, nice! 👍

…ed into ds-inference/bloom-fix

mayank31398 · 2022-07-29T17:46:39Z

Still getting this error @RezaYazdaniAminabadi
I am on master branch now

llm-test-cluster-9:1727013:1729609 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

running with batch size = 1

mayank31398 · 2022-07-29T18:05:08Z

Still getting this error @RezaYazdaniAminabadi I am on master branch now

llm-test-cluster-9:1727013:1729609 [4] include/alloc.h:50 NCCL WARN Cuda failure 'an illegal memory access was encountered'
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO channel.cc:20 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:373 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:774 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO init.cc:904 -> 1
llm-test-cluster-9:1727013:1729609 [4] NCCL INFO group.cc:72 -> 1 [Async thread]
Traceback (most recent call last):
  File "scripts/inference/bloom-ds-inference.py", line 257, in <module>
    _ = generate()
  File "scripts/inference/bloom-ds-inference.py", line 244, in generate
    outputs = model.generate(**input_tokens, **generate_kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1288, in generate
    return self.greedy_search(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/generation_utils.py", line 1683, in greedy_search
    outputs = self(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 508, in forward
    outputs = self.model_orig_fwd(*inputs, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 919, in forward
    transformer_outputs = self.transformer(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 806, in forward
    outputs = block(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 831, in forward
    self.attention(input,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 543, in forward
    output = DeepSpeedSelfAttentionFunction.apply(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/transformer_inference.py", line 466, in forward
    dist.all_reduce(output, group=mp_group)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 312, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 49, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor,
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1287, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1639180588308/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

running with batch size = 1

Ran this with CUDA 11.6 with DeepSpeed on master branch.
This resolved the issue.

mayank31398 · 2022-08-01T01:08:40Z

@RezaYazdaniAminabadi how much time is cached tp model loading supposed to take?
It shows Loading 2 shards for me and after the progress bar is complete, it seems to be stuck
This is how I am using it:

checkpoints_json = os.path.join(
    args.mp_cached_model_path, "BLOOM-176B_ds-inference_config.json")

self.model = deepspeed.init_inference(
    self.model,
    mp_size=world_size,
    dtype=args.dtype,
    checkpoint=checkpoints_json,
    replace_with_kernel_inject=True
)

self.model is loaded using HF AutoModel as in bloom-ds-inference.py

mayank31398 · 2022-08-01T02:07:29Z

nvm
I switched to NVME drive, it took 2 mins and on a hard drive it took 12 mins. Didn't know it could make that much of a difference.

mayank31398 · 2022-08-21T17:50:32Z

@RezaYazdaniAminabadi I am seeing

NotImplementedError: Cannot copy out of meta tensor; no data!

again after updating to master branch and saving without providing checkpoint json

jeffra · 2022-08-21T17:54:22Z

@RezaYazdaniAminabadi I am seeing
NotImplementedError: Cannot copy out of meta tensor; no data!
again after updating to master branch and saving without providing checkpoint json

Just want to double check, did your install include this commit in master? #2237

mayank31398 · 2022-08-21T17:55:31Z

@jeffra yes I am on the latest commit

mayank31398 · 2022-08-21T18:01:12Z

I use this code:

run with :

deepspeed --num_gpus 8 scripts/bloom-inference-server/cache_ds_checkpoints.py --model_name bigscience/bloom --dtype fp16 --save_mp_checkpoint_path ../DS_cache

import argparse
import os

import deepspeed
import torch
from transformers import AutoConfig, AutoModelForCausalLM


def get_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser()

    group = parser.add_argument_group(title="launch config")
    group.add_argument("--local_rank", required=False,
                       type=int, help="used by dist launchers")
    group.add_argument("--save_mp_checkpoint_path", required=True,
                       type=str, help="MP checkpoints path for DS inference")

    group = parser.add_argument_group(title="model")
    group.add_argument("--model_name", type=str,
                       required=True, help="model to use")
    group.add_argument("--dtype", type=str, required=True,
                       choices=["bf16", "fp16"], help="dtype for model")

    args = parser.parse_args()

    if (args.dtype == "bf16"):
        args.dtype = torch.bfloat16
    elif (args.dtype == "fp16"):
        args.dtype = torch.float16

    return args


def main() -> None:
    args = get_args()

    if (args.local_rank == 0):
        print("Loading model...")
    world_size = int(os.getenv("WORLD_SIZE", "1"))

    # Load model
    with deepspeed.OnDevice(dtype=args.dtype, device="meta"):
        model = AutoModelForCausalLM.from_config(
            AutoConfig.from_pretrained(args.model_name),
            torch_dtype=torch.bfloat16
        )
    model = model.eval()

    if (args.dtype == torch.float16):
        model = deepspeed.init_inference(
            model,
            mp_size=world_size,
            dtype=args.dtype,
            replace_with_kernel_inject=True,
            save_mp_checkpoint_path=args.save_mp_checkpoint_path
        )
    elif (args.dtype == torch.bfloat16):
        raise NotImplementedError("bfloat16 is not yet supported")

    print("Model loaded")


if (__name__ == "__main__"):
    main()

mayank31398 · 2022-08-22T00:12:30Z

@jeffra This issue is blocking bigscience-workshop/Megatron-DeepSpeed#328

mayank31398 · 2022-08-27T22:11:26Z

@jeffra ^^

pai4451 · 2022-08-30T01:20:37Z

@RezaYazdaniAminabadi I am seeing
NotImplementedError: Cannot copy out of meta tensor; no data!
again after updating to master branch and saving without providing checkpoint json

@mayank31398 Hi, also facing this issue. Is it solved right now?

mayank31398 · 2022-08-30T02:52:15Z

@pai4451 not yet

Fix random token-generation issue + MP-checkpoint loading/saving

cc0a7db

RezaYazdaniAminabadi requested review from jeffra, samyam, tjruwase, ShadenSmith, conglongli, awan-10, cli99, eltonzheng, minjiaz, duli2012, mrwyattii, yaozhewei, arashb, xiaoxiawu-microsoft and samadejacobs as code owners July 25, 2022 22:42

RezaYazdaniAminabadi and others added 10 commits July 25, 2022 15:42

Merge branch 'master' into ds-inference/bloom-fix

79ba8b9

small fix

b7085ea

Merge branch 'master' into ds-inference/bloom-fix

dc7fa6e

get the path for saving mp-checkpoints

fa6b6ae

Merge branch 'ds-inference/bloom-fix' of github.com:microsoft/DeepSpe…

f39c78f

…ed into ds-inference/bloom-fix

Merge branch 'master' into ds-inference/bloom-fix

13b1aa4

bug fix + formatting

1ae7896

fix save_checkpoint path

070f022

Merge branch 'ds-inference/bloom-fix' of github.com:microsoft/DeepSpe…

c70b529

…ed into ds-inference/bloom-fix

Merge branch 'master' into ds-inference/bloom-fix

51c2e7d

stas00 mentioned this pull request Jul 26, 2022

BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO bigscience-workshop/Megatron-DeepSpeed#308

Merged

jeffra and others added 3 commits July 26, 2022 15:48

Merge branch 'master' into ds-inference/bloom-fix

3dcdbe4

Merge branch 'master' into ds-inference/bloom-fix

75b33c6

Merge branch 'master' into ds-inference/bloom-fix

e8ef956

zcrypt0 reviewed Jul 28, 2022

View reviewed changes

support checkpoint as dict or json file

4d5a4ac

jeffra mentioned this pull request Jul 28, 2022

Bloom custom ckpt files, improved model path support, MII refactor, flake8, etc. microsoft/DeepSpeed-MII#43

Merged

jeffra and others added 5 commits July 29, 2022 04:12

fix for non-bloom models

b8237c7

add default if parallelization doesn't exist

007551e

fix the path to save non-tp checkpoint

851a681

Merge branch 'ds-inference/bloom-fix' of github.com:microsoft/DeepSpe…

a53fe08

…ed into ds-inference/bloom-fix

Merge branch 'master' into ds-inference/bloom-fix

b249f27

jeffra approved these changes Jul 29, 2022

View reviewed changes

jeffra merged commit 556f005 into master Jul 29, 2022

jeffra deleted the ds-inference/bloom-fix branch July 29, 2022 00:26

RezaYazdaniAminabadi mentioned this pull request Jul 29, 2022

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch) bigscience-workshop/Megatron-DeepSpeed#318

Open

mayank31398 mentioned this pull request Aug 21, 2022

Add generation server scripts using HF accelerate and DS-inference bigscience-workshop/Megatron-DeepSpeed#328

Merged

lanking520 mentioned this pull request Oct 4, 2022

[Question] How to preshard a model for tensor parallism #2379

Open

pai4451 mentioned this pull request Oct 24, 2022

[BUG] MP-sharded checkpoint loading does not work for models except BLOOM #2442

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix random token-generation issue + MP-checkpoint loading/saving #2132

Fix random token-generation issue + MP-checkpoint loading/saving #2132

RezaYazdaniAminabadi commented Jul 25, 2022

RezaYazdaniAminabadi commented Jul 28, 2022

zcrypt0 commented Jul 28, 2022 •

edited

Loading

zcrypt0 commented Jul 28, 2022 •

edited

Loading

zcrypt0 Jul 28, 2022 •

edited

Loading

RezaYazdaniAminabadi Jul 28, 2022

RezaYazdaniAminabadi Jul 28, 2022

RezaYazdaniAminabadi Jul 28, 2022

zcrypt0 Jul 29, 2022 •

edited

Loading

zcrypt0 Jul 29, 2022

mayank31398 commented Jul 29, 2022 •

edited

Loading

mayank31398 commented Jul 29, 2022

mayank31398 commented Aug 1, 2022 •

edited

Loading

mayank31398 commented Aug 1, 2022

mayank31398 commented Aug 21, 2022

jeffra commented Aug 21, 2022

mayank31398 commented Aug 21, 2022 •

edited

Loading

mayank31398 commented Aug 21, 2022 •

edited

Loading

mayank31398 commented Aug 22, 2022 •

edited

Loading

mayank31398 commented Aug 27, 2022

pai4451 commented Aug 30, 2022

mayank31398 commented Aug 30, 2022

Fix random token-generation issue + MP-checkpoint loading/saving #2132

Fix random token-generation issue + MP-checkpoint loading/saving #2132

Conversation

RezaYazdaniAminabadi commented Jul 25, 2022

RezaYazdaniAminabadi commented Jul 28, 2022

zcrypt0 commented Jul 28, 2022 • edited Loading

zcrypt0 commented Jul 28, 2022 • edited Loading

zcrypt0 Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

RezaYazdaniAminabadi Jul 28, 2022

Choose a reason for hiding this comment

RezaYazdaniAminabadi Jul 28, 2022

Choose a reason for hiding this comment

RezaYazdaniAminabadi Jul 28, 2022

Choose a reason for hiding this comment

zcrypt0 Jul 29, 2022 • edited Loading

Choose a reason for hiding this comment

zcrypt0 Jul 29, 2022

Choose a reason for hiding this comment

mayank31398 commented Jul 29, 2022 • edited Loading

mayank31398 commented Jul 29, 2022

mayank31398 commented Aug 1, 2022 • edited Loading

mayank31398 commented Aug 1, 2022

mayank31398 commented Aug 21, 2022

jeffra commented Aug 21, 2022

mayank31398 commented Aug 21, 2022 • edited Loading

mayank31398 commented Aug 21, 2022 • edited Loading

mayank31398 commented Aug 22, 2022 • edited Loading

mayank31398 commented Aug 27, 2022

pai4451 commented Aug 30, 2022

mayank31398 commented Aug 30, 2022

zcrypt0 commented Jul 28, 2022 •

edited

Loading

zcrypt0 commented Jul 28, 2022 •

edited

Loading

zcrypt0 Jul 28, 2022 •

edited

Loading

zcrypt0 Jul 29, 2022 •

edited

Loading

mayank31398 commented Jul 29, 2022 •

edited

Loading

mayank31398 commented Aug 1, 2022 •

edited

Loading

mayank31398 commented Aug 21, 2022 •

edited

Loading

mayank31398 commented Aug 21, 2022 •

edited

Loading

mayank31398 commented Aug 22, 2022 •

edited

Loading