Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture short kernel sequences to graph #4318

Merged
merged 23 commits into from
Dec 20, 2023

Conversation

inkcherry
Copy link
Contributor

@inkcherry inkcherry commented Sep 13, 2023

Motivation:

  1. This is a series of cases where short kernel sequences are launched and executed serially(no dynamic shape), with the launch overhead being much higher than the execution overhead. We can use a graph to solve this problem. Compared to multi-tensor-apply, using graph is more concise and only requires PyTorch as a dependency.
  2. Some device software stacks also support lazy-mode PyTorch, enabling full utilization of the compiler to perform graph optimization. However, in lazy mode, operation accumulation time (host time) could become significantly higher compared to device time in such scenario, and devices are usually not well utilized. By using the same API(after adding to accelerator cc @delock ) with cuda graph, this issue could also be resolved.

Change:
We modified three functions,
update_hp_grads. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, set replay_first_step to True). Therefore, we changed grad=None to grad.zero_(). Similarly, we have also placed some inputs that require fixed addresses in the graph_cache

For clip_tensors_by_global_norm, clip_coef is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph.

For total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors]), item () , synchronous operation is also not supported by graph. We directly put the sum and * * norm_type on the GPU to execute the computation.

Other similar scenarios can also use this graph_process(), or a slightly modified version of graph_process()

you can checkout
4abab21 and set it to True here to do some benchmarking.
4abab21#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42

@tjruwase
Copy link
Contributor

@inkcherry, can you please give more description of this PR?

@inkcherry
Copy link
Contributor Author

@inkcherry, can you please give more description of this PR?

@tjruwase added : )

@inkcherry
Copy link
Contributor Author

inkcherry commented Oct 7, 2023

Sorry for not replying in time due to regional holidays~.
q2: this may lead to a crash or error.

import torch
b=None
def func(a):
    global b
    b = a
    for _ in range(10):
            b = b + 1
    
s = torch.cuda.Stream()
a = torch.full((1000,), 1, device="cuda")
static_mem = a.data_ptr()
with torch.cuda.stream(s):
    g = torch.cuda.CUDAGraph()
    torch.cuda.empty_cache()
    with torch.cuda.graph(g):
        func(a)
torch.cuda.current_stream().wait_stream(s)


# # 1 This may lead to a crash because the static_mem is freed, or if another variable reallocates static_mem, it will lead to incorrect behavior.
# a = None
# torch.cuda.empty_cache()
# #...
# g.replay()
# print(b.sum().item())  


# # 2 This will not crash but could produce incorrect results due to the memory change
# a = torch.full((1000,), 2, device="cuda")
# g.replay()
# print(b.sum().item())  

# #3 This is correct, we need to make the memory fix.
# a.copy_(torch.full((1000,), 2, device="cuda"))
# g.replay()
# print(b.sum().item())  

So if the address changes, will lead to unexpected behavior.
q1:
To verify, first, 'func' is expected to use the same input in logic, like a fixed part of weight/gradient.
Then, check whether the address of this input remains unchanged every time func is called without a graph. For example, create and set string var for i-th call of _update_hp_grads_func here. https://github.com/microsoft/DeepSpeed/pull/4318/files#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR286

i-th_call_update_hp_grads_mem_list += f"{hp_grad.data_ptr()},{lp.grad.data_ptr()}"

and check if each i-th_call_update_hp_grads_mem_list is the same at each rank. And it would be better If combined with the graph or kernel dump tool provided by the device.
I hope this answer is helpful to you:)

def create_graph(self):
return torch.cuda.CUDAGraph()

def capture_to_graph(self, graph):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change the interface to add parameters such as (graph, pool=None, stream=None) to align with https://pytorch.org/docs/master/generated/torch.cuda.graph.html#torch.cuda.graph

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API at https://pytorch.org/docs/master/generated/torch.cuda.CUDAGraph.html#torch.cuda.CUDAGraph.pool is also important, it is used to share the memory pool between graphs, we can add this API in this PR or in future PR when it is really required.

@inkcherry
Copy link
Contributor Author

Thank you all for your reviews and suggestions.
I've made the some changes, could you please take a look @tjruwase :)

@inkcherry
Copy link
Contributor Author

It seems that the failure is not caused by this modification, and I can pass it locally. Could you please retrigger the check? Thank you very much! @tjruwase
image

@tjruwase
Copy link
Contributor

@inkcherry, can you please check the formatting issue?

@inkcherry
Copy link
Contributor Author

@inkcherry, can you please check the formatting issue?

Thanks for your reminder . I have now fixed the formatting. @tjruwase

@inkcherry
Copy link
Contributor Author

It seems that this CI workflow is a bit unlucky. Two of the commit passed the ci check, while others seem to have encountered some failures that were not caused by this PR.
Sorry to bother you again, could you help me retry CI again. Thank you very much for your time. @tjruwase

@tjruwase
Copy link
Contributor

@inkcherry, thanks it is no trouble at all. We appreciate your great contributions!

@inkcherry
Copy link
Contributor Author

@tjruwase The CI has all passed. Just a reminder in case you missed it

@tjruwase tjruwase added this pull request to the merge queue Dec 20, 2023
Merged via the queue into microsoft:master with commit d5a7c1e Dec 20, 2023
15 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
**Motivation:**
1. This is a series of cases where short kernel sequences are launched
and executed serially(no dynamic shape), with the launch overhead being
much higher than the execution overhead. We can use a graph to solve
this problem. Compared to ```multi-tensor-apply```, using graph is more
concise and only requires PyTorch as a dependency.
2. Some device software stacks also support lazy-mode PyTorch, enabling
full utilization of the compiler to perform graph optimization. However,
in lazy mode, operation accumulation time (host time) could become
significantly higher compared to device time in such scenario, and
devices are usually not well utilized. By using the same API(after
adding to accelerator cc @delock ) with cuda graph, this issue could
also be resolved.

**Change:**
We modified three functions, 
```update_hp_grads```. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, set ```replay_first_step``` to ```True```). Therefore, we changed ```grad=None``` to ```grad.zero_()```. Similarly, we have also placed some inputs that require fixed addresses in the ```graph_cache``` 

For ```clip_tensors_by_global_norm```, ```clip_coef``` is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph.


For ```total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors])```, ```item () ```, synchronous operation is also not supported by graph. We directly put the ```sum``` and ```* * norm_type``` on the GPU to execute the computation.

Other similar scenarios can also use this ```graph_process()```, or a slightly modified version of ```graph_process()```

you can checkout
[4abab21](microsoft@4abab21)  and set it to True here to do some benchmarking.
microsoft@4abab21#diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants