Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerator abstraction #2320

Closed
wants to merge 21 commits into from
Closed

Conversation

tjruwase
Copy link
Contributor

@tjruwase tjruwase commented Sep 13, 2022

This PR is to help reduce the burden of supporting Deep Learning accelerators in DeepSpeed. We expect at least two concrete benefits from:

  1. Adding and maintaining accelerator logic will require minimal changes to DeepSpeed code.
  2. DeepSpeed development will involve minimal changes to accelerator code.

This PR is heavily influenced by #2221. The main differences from #2221 are:

  1. Introduction of an abstract DeepSpeedAccelerator class that encapsulates accelerator logic.
  2. A single global DeepSpeedAcccelerator object that is lazily initialized based on available torch modules (e.g., torch.cuda, torch.xpu) and will be used throughout DeepSpeed code to access accelerator functionalities.
  3. Concrete implementation of DeepSpeedAccelerator (e.g., for CUDA or XPU) will be external modules that will be imported by DeepSpeed during initialization.
  4. Eliminate control flow related to accelerator selection (e.g., if ... else) outside of initialization code.

This PR includes a proposed implementation of a CUDA accelerator. Below is output of a basic test on V100x32GB:
image

@delock
Copy link
Collaborator

delock commented Sep 28, 2022

For device abstraction, one way is make literal_device() an interface of accelerator. During porting I found there is still need to use literal string to mark device in many places.

def literal_device(self, device_index=None):
    if device_index == None:
        return 'cuda'
    return 'cuda:{}'.format(device_index)

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/activation_checkpointing/checkpointing.py#L557
-->
cuda_device = get_accelerator().literal_device(get_accelerator().current_device())

also it might make sense to have current_literal_device() for the sake of code readability.

We are testing accelerator abstraction interfaces in our environment. When we finish we will push to #2221. We are looking forward to graduately converging with #2320 .

@tjruwase
Copy link
Contributor Author

For device abstraction, one way is make literal_device() an interface of accelerator. During porting I found there is still need to use literal string to mark device in many places.

In general, it is alright to extend the interface if you find compelling cases in your testing. However, I think the literal_device() scenario below could be handled differently as I will explain in a separate post.

We are testing accelerator abstraction interfaces in our environment. When we finish we will push to #2221. We are looking forward to graduately converging with #2320 .

Thanks, that would be great. The primary goal of this PR is to experiment and foster our alignment. So, it is great if your #2221 can subsume it since you are able to test the XPU accelerator.

@tjruwase
Copy link
Contributor Author

def literal_device(self, device_index=None):
    if device_index == None:
        return 'cuda'
    return 'cuda:{}'.format(device_index)

https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/activation_checkpointing/checkpointing.py#L557 --> cuda_device = get_accelerator().literal_device(get_accelerator().current_device())

I think in this case, the deepspeed code should be changed so that Stream() is called without device argument so that it defaults to None. This way, each concrete accelerator can handle as appropriate. For CUDA, this will be current device, but perhaps some other accelerator may want to do things differently. What do you think?

also it might make sense to have current_literal_device() for the sake of code readability.

Can you please share a compelling case for this? I think that integer indices are sufficient to identify devices, perhaps I am missing some scenarios that you have encountered, so please correct me. My main goal is to minimize the API surface as much as possible to simplify development and testing. As an alternative, would it be sufficient to have an API that returns a human readable name for a device index. For example, the cuda implementation of such a method could be:

   def readable_device_name(self, device_index=None): 
         return f'cuda:{torch.cuda.current_device()}' if device_index is None else f'{cuda:{device_index}'

@delock
Copy link
Collaborator

delock commented Sep 29, 2022

The necessarty of a literal device string or readable device name lies behind that device name is used in pytorch tensor function for accelerators other than cuda.

For example, if we have a tensor T
T.to(1) would copy T to cuda device 1
If we want to copy it to an xpu device, we need to write
T.to('xpu:1')
We can also specify cuda device explicity by a literal device string
T.to('cuda:1')

pytorch also use device name 'cuda' to specify current cuda device, in that case, the device name is 'cuda' or 'xpu' instead of None.

In the following function that take cuda_device as parameter, tensor interface are used. Thats the reason we have to replace device_index by device name in places where device id are used in tensor interface.

def copy_to_device(item, device, criterion_func):

There is choice of when the conversion from device_index to device name. In #2221, we choose to pick device name as the default form, and pass device name instead of device index in deepspeed function parameters.

Alternatively, if device index are passed in all deepspeed function, then conversion to device name will be needed when pytorch tensor interface are called, or any other place where device name are needed.

@delock
Copy link
Collaborator

delock commented Sep 29, 2022

The necessarty of a literal device string or readable device name lies behind that device name is used in pytorch tensor function for accelerators other than cuda.

For example, if we have a tensor T T.to(1) would copy T to cuda device 1 If we want to copy it to an xpu device, we need to write T.to('xpu:1') We can also specify cuda device explicity by a literal device string T.to('cuda:1')

pytorch also use device name 'cuda' to specify current cuda device, in that case, the device name is 'cuda' or 'xpu' instead of None.

In the following function that take cuda_device as parameter, tensor interface are used. Thats the reason we have to replace device_index by device name in places where device id are used in tensor interface.

def copy_to_device(item, device, criterion_func):

There is choice of when the conversion from device_index to device name. In #2221, we choose to pick device name as the default form, and pass device name instead of device index in deepspeed function parameters.

Alternatively, if device index are passed in all deepspeed function, then conversion to device name will be needed when pytorch tensor interface are called, or any other place where device name are needed.

In deepspeed code, https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/activation_checkpointing/checkpointing.py#L115 is the only place that current_device() can be used for other accelerators. In all other places where current_device is used, an alternative name like 'current_device_name()' should be used to return a device name.

@tjruwase
Copy link
Contributor Author

The necessarty of a literal device string or readable device name lies behind that device name is used in pytorch tensor function for accelerators other than cuda.

Thanks for the explanation; I am convinced. I had incorrectly assumed that accelerator.device() could be used in the tensor interfaces in place of torch.device. But now, I find out that torch.cuda.device is not a subclass of torch.device.

>>> isinstance(torch.cuda.device(0), torch.device)
False

Hopefully in future the pytorch tensor interface could remove the 'cuda' hardcoding to simplify things. For now, I have added device_name() that returns the literal string. Will that work? Please feel free to modify as needed. Thanks!

deepspeed/accelerator/abstract_accelerator.py Show resolved Hide resolved
...

@abc.abstractmethod
def range_pop(self, msg):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

range_pop does not take msg as argument, only range_push does.
https://pytorch.org/docs/stable/generated/torch.cuda.nvtx.range_pop.html

assert isinstance(accel_obj, DeepSpeedAccelerator), \
f'{accel_obj.__class__.__name__} accelerator is not subclass of DeepSpeedAccelerator'

assert accel_obj.is_available(), \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our internal test shows this check 'is_available()' breaks unit test. Call to is_available would initialize cuda too early and cause initialization error.

specifically, this test is broken with this assertion.
pytest -k "test_ckpt_arg_none" test_activation_checkpointing.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please share a stack trace for this? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the stack trace of this error in our CUDA environment

$ pytest -k "test_ckpt_arg_none" test_activation_checkpointing.py
......
Worker 0 exited with code 1
----------------------------- Captured stdout call -----------------------------
[2022-10-10 10:27:14,784] [INFO] [comm.py:639:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
------------------------------------------ Captured stderr call ------------------
Process Process-1:
Traceback (most recent call last):
  File "/home/gma/anaconda3/envs/ds/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/gma/anaconda3/envs/ds/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/gma/mingzhil/CI/frameworks.ai.benchmarking.other.deepspeed/tests/unit/common.py", line 250, in dist_init
    dist.barrier()
  File "/home/gma/mll/CI/frameworks.ai.benchmarking.other.deepspeed/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/gma/mll/CI/frameworks.ai.benchmarking.other.deepspeed/deepspeed/comm/comm.py", line 459, in barrier
    return cdb.barrier()
  File "/home/gma/mll/CI/frameworks.ai.benchmarking.other.deepspeed/deepspeed/comm/torch.py", line 153, in barrier
    return torch.distributed.barrier()
  File "/home/gma/anaconda3/envs/ds/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2776, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
......
===================== short test summary info ==========================
FAILED test_activation_checkpointing.py::test_ckpt_arg_none
=================== 1 failed, 21 deselected, 2 warnings in 1.06s ========================

@tjruwase
Copy link
Contributor Author

@delock, since you have merged this into your PR, I feel this one has already achieved its goal of helping us align on core design issues. So, I propose to abandon this and focus on yours. What do you think?

@delock
Copy link
Collaborator

delock commented Oct 12, 2022

@tjruwase
Sure, let me merge your latest interface changes then we can do co-developing.

delock added a commit to delock/DeepSpeedSYCLSupport that referenced this pull request Oct 12, 2022
delock added a commit to delock/DeepSpeedSYCLSupport that referenced this pull request Oct 12, 2022
* [downstream] merge from upstream including microsoft#2320

* [xpu] add xpu to real_accelerator

* [accelerator] enable xpu_accelerator

* [accelerator abstraction] use accelerator abstraction interface in microsoft#2320

* [accelerator abstraction] use current_device_name() inplace of current_device()

* [literal_device] use literal_device interface in XPU_Accelerator class

* use get_accelerator().device(index) instead of torch.device(get_accelerator().literal_device(index))

* [accelerator abstraction] convert tests according to XPU_Accelerator interface

* change back incorrect convertion of torch.device(get_accelerator().literal_device())

* align convertion code with original form

* [xpu] add stream to interface

* change 'literal_device' to 'device_name'

* add on_accelerator interface to XPU_Accelerator

* update __init__.py for end of file issue

* add device_name in cuda accelerator

* remove literal_device in cuda_accelerator

* use device_name instead of literal_device in benchmarks

* turn off CUDA device validation in real_accelerator

* don't pass device_index in get_rng_state if it is None

* fix range_pop interface

* remove msg argument from range_pop

* fix cuda_accelerator implementation of is_fp16_supported
@delock
Copy link
Collaborator

delock commented Oct 13, 2022

@tjruwase I found there is one interface semantic difference between #2221 and #2320.

In #2320, the default behavior of device_name() is get current device name, and name() is use to get device name without index.
device_name() --> current device name such as 'cuda:0'
name()--> device name without index such as 'cuda'

In #2221, the default behavior of device_name() is get device name without index, and current_device_name() is to get current device name
device_name()--> device name without index such as 'cuda'
current_device_name()--> current device name such as 'cuda:0'

I think the variant in #2221 is better, because

  1. the original code to use current device is torch.cuda.current_device(), use current_device_name() looks more natural modification.
  2. 'cuda' and 'cuda:index' are both valid way to specify device name in pytorch, so let them share the same interface device_name() looks natural.

My suggestion is remove name() interface and keep interface device_name() and current_device_name().

@tjruwase
Copy link
Contributor Author

My suggestion is remove name() interface and keep interface device_name() and current_device_name().

This sounds good with me.

delock added a commit to delock/DeepSpeedSYCLSupport that referenced this pull request Nov 8, 2022
* Fix the layer-past for GPT based models (microsoft#2196)

* Add gradient_average flag support for sparse grads (microsoft#2188)

* Add gradient_average flag support for sparse grads

* formatting fixes

* Add tests

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Adding additional instructiosn in the compression tutorial on pre-training distillation and quantization for GPT (microsoft#2197)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Log user config exactly (microsoft#2201)

* Fix the tensor-slicing copy for qkv parameters (microsoft#2198)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Refactor Distributed Tests (microsoft#2180)

Refactor Distributed unit tests

* fix table syntax (microsoft#2204)

Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Correctly detect offload configuration (microsoft#2208)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add cuda 11.7 (microsoft#2211)

* add cuda 11.7

* formatting

* use torch 1.9 (microsoft#2215)

* [zero-3] print warning once and support torch parameter (microsoft#2127)

* print warning only once.

* add support for torch param and only warn on gpu 0

* remove type checking. will be done on a new PR with more tests.

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Add support of OPT models (microsoft#2205)

* add opt replace policy

* simplify inf. api

* fix opt replace policy

* fix use-cash & add relu

* Add support of custom MLP act. function

* Revert "simplify inf. api"

This reverts commit 9e910fc.

* fix the inference API (temp. solution)

* fix code formatting

* add unit tests for OPT models.

* refactor pre-attention layer norm configuration

* add support of opt-350m model

* refactor the HF model config initialization

* fix hf model config issue

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

* fix typos in readme. (microsoft#2218)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [device abstraction] add device abstraction to allow other device than CUDA be used

* Fix regression w. dist_init_required (microsoft#2225)

* add doc for new bert example (microsoft#2224)

* Remove the random-generator from context during inference (microsoft#2228)

* Fix the tensor-slicing copy for qkv parameters

* remove the random-generator from context during inference

* formatting

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* allow saving ckpt w/o ckpt json + bloom copy fix (microsoft#2237)

* Correctly detect zero_offload (microsoft#2213)

* Correctly detect offload configuration

* Correctly detect offload configuration

* Handle deprecated cpu offload setting

* Correcly detect zero_offload setting

* Minor tweak

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>

* update videos (microsoft#2249)

* Refactor dist tests: Checkpointing (microsoft#2202)

Refactor distributed tests: checkpointing

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* Make OPT policy backward compatible with pre-OPT transformers versions (microsoft#2254)

* fix ds-inference without policy (microsoft#2247)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump to 0.7.2

* Enable contiguous gradients with Z1+MoE (microsoft#2250)

MoE training with zero stage 1 only works with `contiguous gradients=True`.

* [rebase-202208] additional changes needed when rebase to 202208

* [rebase] cleanup direct cuda usage after merge

* Correctly detect CPU optimizer usage (microsoft#2257)

* Correctly detect CPU optimizer usage

* Update nv-transformers-v100.yml (microsoft#2259)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [precommit] fix pre-commit issues

* Update half precision header guards (microsoft#2261)

* fix microsoft#2240: wrong time unit in flops_profiler (microsoft#2241)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump to 0.7.3

* Add blob storage to CI runners (microsoft#2260)

Add blob storage to CI runners and enable for transformers cache on inference tests

* Update replace_module.py, test-gptj.py related fix (microsoft#2269)

Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py

* Fix OrderedDict import for python3.6 (microsoft#2267)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Ds inference/fix mp2 (microsoft#2270)

* Trajepl: nebula load fix (microsoft#2182)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: chenguo <chenguo@microsoft.com>

* prevent torch ext folder mkdir at tmp (microsoft#2274)

* Ds-inference Int8 support through ZeroQuant technology (microsoft#2217)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add a new unit test for cuda ops (microsoft#2278)

Co-authored-by: cmikeh2 <connorholmes@microsoft.com>

* Add to codeowners file (microsoft#2279)

* [pin_memory] make pin_memory select device type

* Memory Access Utility (microsoft#2276)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>

* Fp32 accuracy bug fix (microsoft#2285)

Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com>

* Refactor universal checkpointing and tensor fragments (microsoft#2253)

* Refactor universal checkpointing and tensor fragments

* Formatting

* [ds-inference] fix progress bar (microsoft#2286)

when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Offload all gradients to nvme (microsoft#2282)

* fused bias relu unittest (microsoft#2297)

* fix for pytest picking up local deepspeed dir instead of installed deepspeed (microsoft#2299)

* Fix for Zero3 when MP>1 and at least one batch param undefined (microsoft#2289)

Co-authored-by: anthony.301 <anthony.301@mri.cluster>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [downstream] merge from xpu support downstream

* Unit test for bias add kernel (microsoft#2298)

* added unit test

* Update pt_binding.cpp

* formatting

* Update test_bias_add.py

* Update relu.cu with mem_access_utils (microsoft#2306)

* Add tensor parallel inference unit tests (microsoft#2232)

Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com>

* Fix the residual add mp scaling for  GPTNeoX (microsoft#2310)

* Add unit tests for residual_add kernels (microsoft#2307)

* add inference eval scripts (microsoft#2303)

* Upgrade P40 tests to torch 1.8 (microsoft#2316)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO-Inference blog (microsoft#2271)

* ZeRO-Inference blog

* ZeRO-Inference blog

* Format fixes

* Apply feedback

* Feedback

* Update docs/_posts/2022-08-27-zero-inference.md

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update docs/_posts/2022-08-27-zero-inference.md

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Address feedback

* Format fixes

* More tweaks

* long sequence, nvme offload

* Add image

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO-Inference blog - wrap up  (microsoft#2321)

* ZeRO-Inference blog - Update README (microsoft#2322)

* refactor to use mem_access (microsoft#2317)

* add quant unit test (microsoft#2315)

* add quant unit test

* add codeowner

* format fix

* fix undefined symbol: curandSetPseudoRandomGeneratorSeed

* modify ref fn name and add comment

* add comments

* add 4bit quant 16groups

* fix

* modify groups in ref code

* parameterize tensor shape

* single param

* detach tensor

* remove -lcurand flag

* add back -lcurand flag

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>

* only override forward if using cuda-graph (microsoft#2291)

* Add more options to inference benchmark (microsoft#2325)

* bump to 0.7.4

* MOE residual matmult unit test (microsoft#2323)

MOE residual matmul unit tests

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>

* [device] port cuda device to literal_device() in new tests

* MOE matmult with memaccess (microsoft#2336)

* Fix formatting

* Remove redundant variable

* Refactor residual add kernels (microsoft#2333)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>

* [accel_runtime] add pin_memory to accelerator runtime interface.

* mem access for quantize kernel (microsoft#2331)

* mem access for quantize kernel

* format

* format fp32

* modify quant kernel

* modify quant kernel2

* modify format

* format

* fix comments in pytest

* fix comments in pytest

* format

* rerun

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>

* increase min pre-commit versions (microsoft#2346)

* Extend scratch buffer for long prompts (microsoft#2212)

Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix zero docs (microsoft#2350)

* Inference profiling updates/fixes (microsoft#2348) (microsoft#2349)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* Kernel Data Conversion Utility (microsoft#2327)

* Unify macro definitions and constants in a single file

* Conversion utility implementation.

* Fix reversion from formatting

* Bugfixes after testing with correct DeepSpeed

* Inline markers are available on both HIP + CUDA

* Add Onebit Optimzers in __init__ (microsoft#2340)

Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [accelerator abstraction] merge from microsoft#2320

* docs(mixture-of-experts-inference): fix typo in tuto (microsoft#2345)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* download cifar to blob storage (microsoft#2342)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Refactor gptj_residual_add kernels for better readability (microsoft#2358)

Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

* Updated issue templates (microsoft#2363)

* Update issue templates

* fix cuda invalid config error in dequant kernel (microsoft#2362)

* format

* remove round fn

* Add missing pytest fixture scope (microsoft#2353)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* Extend residual_add kernel tests to conver pre_attn_norm (microsoft#2354)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Refactor fused_bias_residual kernels for better readability (microsoft#2356)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Capture error message during sweep tests (microsoft#2351)

* Collect error messages in results.csv

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* fix an exception when recursively casting dicts to fp16 (microsoft#2370)

* Refactor remaining distributed tests (microsoft#2216)

* batch of refactored tests

* more test refactoring

* fp16 test refactor

* more refactors

* added DistributedFixture class

* applied DistributedFixture to first batch of tests as a trial

* added DistributedFixture test and documentation

* last tests

* fixes for refactored tests

* remove subdirs in workflow files

* fix pytest syntax error

* fix another syntax error

* update imports

* use DistFixture with elastic checkpoint test

* missing import

* update to shared class tmpdir for elastic test

* moved test files

* avoid duplicate test file name

* last refactor and moving test files

* formatting

* fix broken import

* testing forked AMD tests

* update abstract method

* use blob storage for accelerate and transformers tests

* upgrade torch for acclerate CI

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix the MLP output tensor's shape (microsoft#2380)

* allow building with latest CUDA (11.8), it is backwards compatible (microsoft#2390)

* pin transformers version for unit tests (microsoft#2402)

* Change type to tuple in replace_wo_policy isinstance check (microsoft#2387)

Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type.

Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Molly Smith <mosm@microsoft.com>
Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>

* Checkpoint backwards-compatbility workaround (microsoft#2384)

* Add predicated global load (microsoft#2373)

Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

* change call site of literal_device, on_accel_device and accel_runtime to get_accelerator() call

* add new interface definition from olruwase/accelerator_abstraction

* MII blog post (microsoft#2418)

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>

* Fix figure reference (microsoft#2419)

* [docs] update news items

* [docs] add mii repo link

* Add SLURM Multinode Runner (microsoft#2404)

Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix issue with corrupted output on long generation for GPT (microsoft#2359)

Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* MII blog title update on Readme

* DeepSpeed-MII title change in website

* Fix GPT Neo-X multi-gpu inference (microsoft#2401)

Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* MII-Public and MII-Azure subheading in mii post

* CI fixes related to triton (microsoft#2422)

* [docs] update mii blog title (microsoft#2423)

* add SD injection policy (microsoft#2381)

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>

* [accelerator abstraction] remove name() from interface, device_name() should be used.

* merge with master (ec13da6)

* fix checkpoint loading when it is a dictionary (microsoft#2425)

* Make error regex more generic in collect_results.py (microsoft#2415)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fixes microsoft#2389 (microsoft#2411)

truncating expert param storage for checkpointing

Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* Fix for inference gpt-j test (microsoft#2430)

* fix for gpt-j failing due to tokenizer error

* limit number of gpt-j tokens generated due to low memory

* Fixing bug 2361 (microsoft#2410)

* fixing bug 2361

* adding pytest for config initialization

* chaning expected output to FusedAdam

* remove print statement

* running yapf on modified files

* running pre-commit formatting

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Universal checkpoint for zero stage 1 (microsoft#2284)

* Refactor universal checkpointing and tensor fragments

* Formatting

* Support zero stage1; Expand TP dim

* Remove debug prints

* Detect sharded optimizer state

* Format fixes

* Encode reshaping guide

* More symbolic constants

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* only add deps if extra is explictly called (microsoft#2432)

* Add TestInjectionPolicy inference unittest class for testing custom injection policies (microsoft#2426)

This PR adds a TestInjectionPolicy inference unittest class for testing custom injection policies.

This test differs from the existing tests in that the injection_policy dictionary is explicitly specified when calling the DeepSpeed init_inference API.

The google/t5-v1_1-small text2text-generation model and the roberta-large fill-mask model are added as tests with the injection policy explicitly specified.

This is done to expand our unittest coverage to test the path where the replace_wo_policy function is invoked (see microsoftGH-2387).

Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* [memory estimators] new config args sync (microsoft#2431)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* parallelize writing of layer checkpoint files across data parallel instances (microsoft#1419)

* parallelize layer checkpoints across data parallel groups

* use partition_uniform to determine start/end index values

* formatting fix

* config: add option for parallel write of layer checkpoints in pipeline stage

* yapf fixes

* enable parallel layer write according to config param

* avoid extraneous makedir when rank 0 writes all layers

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix broken link to DeepSpeed Megatron fork (microsoft#2440)

Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>

* bump to 0.7.5

* [OpBuilder] Add op builder abstraction

* convert op builder usage in merged code

* merge diff files from upstream

* [OpBuilder] add create_op_builder interface in abstract_accelerator.py

* remove files that is deleted from upstream

* [OpBuilder] add left over op builder usage in tests

* [OpBuilder] fix op builder usage in tests

* [OpBuilder] fix <op builder>.NAME usage in tests to follow op builder abstraction design

* import get_accelerator from deepspeed.accelerator directly

* [OpBuilder] remove unused function and sync with main

* add missing import

* revert changes in device.py to avoid conflict with main

* fix alexnet_model to use /tmp instead of /blob

* Mingzhi/solve pr108 b (microsoft#115)

* move ALL_OPs from __init__.py to all_Op.py to solve circular import

* delete deepspeedexamples

* fix import

* fix regression (microsoft#117)

* fix pin_memory

* fix regression

* fix error

Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Mikhail Druzhinin <dipetm@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Kamal Raj <kamalraj97@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Zhihong Chen <gdst_czh@163.com>
Co-authored-by: Siddharth Singh <siddharth9820@gmail.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: 叶志晟 <yzs981130@126.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: trajep <trajepl@gmail.com>
Co-authored-by: chenguo <chenguo@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: anthony.301 <anthony.301@mri.cluster>
Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: Saeyeol Lee <78332687+l4d2boomer@users.noreply.github.com>
Co-authored-by: Saeyeol Lee <sylee@si-anlaytics.ai>
Co-authored-by: Jean-Louis Queguiner <jean-louis.queguiner@gadz.org>
Co-authored-by: Matt Smith <matt@mjksmith.com>
Co-authored-by: Thomas-MMJ <112830596+Thomas-MMJ@users.noreply.github.com>
Co-authored-by: lekurile <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Molly Smith <mosm@microsoft.com>
Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Dashiell Stander <dstander@protonmail.com>
Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal>
Co-authored-by: Andrey Chernykh <andrew.chernyh@gmail.com>
Co-authored-by: Alexander Jipa <alexander.jipa@gmail.com>
Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Adam Moody <moody20@llnl.gov>
Co-authored-by: AGUL <mingzhi.liu@intel.com>
@tjruwase tjruwase closed this Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants