-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU #39437
Conversation
1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue. 2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2. 2. Act currently only be supported ReLU. (Will add GeLU in the future).
1. Added LinearAct into graph_pattern_detector.* to define (2.)'s pattern. 2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)). 3. act currently only support ReLU (Will support GeLU in the future).
1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU} fusion (GeLU will be supported in the future). 2. Only support matmul_v2 from nn.Linear.
1. Added GeLU support to fused_gemm_epilogue op. 2. Added EpilogueSingleton to cache auxiliary pointer. 3. Added related UTs.
1. Added support of fwd graph with grap_ops linking to LinearAct. 2. Added related changes to fuse_gemm_epilogue_pass for above modification.
1. Added matmul_v2 + ele_add pattern to LinearActPattern. 2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.
1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.
1. Added backward fusion pass to Linear( Act(x)). 2. Added backward fusion pass to Linear(x).
1. Made arguments of some function pass by reference. 2. Removed redundant code. 3. Followed Google code style to change code.
Thanks for your contribution! |
1. Modified way to get cublasLt handler in device_context to be consistent with last changes in develop.
7e616c8
to
2b03377
Compare
2b03377
to
c2f5692
Compare
c2f5692
to
cb3bdae
Compare
cb3bdae
to
1b41b06
Compare
1. Require CUDA 11.6+ 2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.
1b41b06
to
fe8a560
Compare
} | ||
|
||
ir::Graph *FuseGemmEpiloguePass::FuseLinearBwd(ir::Graph *graph, | ||
bool is_first_gemm) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is is_first_gemm
for? What is the difference between the first and the others GEMM?
In the following code, I feel that is_first_gemm == true
means the gradient of X
is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, change to with_x_gradient
.
memory::allocation::AllocationPtr auxiliary = nullptr; | ||
}; | ||
|
||
class EpilogueSingleton { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, EpilogueSingleton
is used to store a memory buffer which is written in the forward cublasLt API. This memory buffer must be passed to the backward cublasLt API without any modification. Therefore, you use a map to save the name-to-memory-buffer mapping here, and the name is the activation output name. Am I right?
I prefer to using something like ReserveSpace
in batch_norm op. It is not encouraged to save the variable name inside the op attribute, which makes the graph dependency analysis, .etc difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
1. Changed arguments name is_first_gemm to without_x_gradient for clearing. 2. Applied PADDLE_THROW in fused_gemm_epilogue_op.
Sorry to inform you that fe8a560's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
1. Applied ReserveSpace to replace Epilogue for passing auxiliary pointers between FWD and BWD.
1. Added act op count checking in UTs. 2. Fix issue to fuse backward or ReLU(Linear(X)). 3. TODO: solve GELU fusion issues.
1. Modified graph_detech_pattern to fit with both linear wiht gelu or relu. 2. Modified data range in Uts to allow negative values.
… cublaslt_epilogue
void Make() override { | ||
AddInput("X", "The input tensor X of Out = Act((X * Y) + bias)."); | ||
AddInput("Y", "The input tensor Y of Out = Act((X * Y) + bias)."); | ||
AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias)."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bias
->Bias
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
AddInput("Y", "The input tensor Y of Out = Act((X * Y) + bias)."); | ||
AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias)."); | ||
|
||
AddOutput("out", "The output tensor Out of Out = Act((X * Y) + bias)."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out
->Out
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias)."); | ||
|
||
AddOutput("out", "The output tensor Out of Out = Act((X * Y) + bias)."); | ||
AddOutput("reserve_space", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserve_space
->ReserveSpace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"The input grad tensor to Out of Out = (Act(X) * Y) + bias"); | ||
AddInput("X", "The input tensor X of Out = (Act(X) * Y) + bias"); | ||
AddInput("Y", "The input tensor Y of Out = (Act(X) * Y) + bias"); | ||
AddInput("reserve_space", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserve_space
->ReserveSpace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
VarDesc reserve_space(patterns::PDNodeName(scope_name, "reserve_space")); | ||
auto *reserve_space_node = g->CreateVarNode(&reserve_space); | ||
|
||
EpiloguePassActivationCache::Instance().InsertFusedActivation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about make EpiloguePassActivationCache::Instance
to be a local variable instead of a singleton? I mean you can change the declaration of FuseGemmEpiloguePass::FuseLinearActFwd
and FuseGemmEpiloguePass::FuseLinearActBwd
to be:
ir::Graph *FuseGemmEpiloguePass::FuseLinearActFwd(
ir::Graph *graph, const std::unordered_set<std::string> &act_types,
bool is_training, bool is_act_grad_x_from_act, EpiloguePassActivationCache *cache) const;
ir::Graph *FuseGemmEpiloguePass::FuseLinearActBwd(
ir::Graph *graph, const std::unordered_set<std::string> &act_grad_types,
bool is_act_grad_x_from_act, const EpiloguePassActivationCache &cache) const;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Made EpiloguePassActivationCache
as local variable and pass to FuseLinearActFwd
and FuseLinearActBwd
.
Using pointer rather than reference, due to request from pre-commit hooks.
1. bias -> Bias. 2. out -> Out. 3. reserve_space -> ReserveSpace.
1. Removed singleton in EpiloguePassActivationCache. 2. Made EpiloguePassActivationCache as an argument to each pass functions.
… cublaslt_epilogue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
PR types
New features
PR changes
OPs
Describe
fused_gemm_epilogue_op
to compute Matmul+ ElementwiseAdd + ReLU|GeLU.fused_gemm_epilogue_grad_op
to compute ElementwiseAdd_grad + Matmul_grad+ [ReLU|GeLU]_grad.fuse_gemm_epilogue
toBuildStrategy
toenable fuse_gemm_epilogue_pass
.