-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable OMP multithreading in lookup_table_v2 #45249
Conversation
This reverts commit f4e4f8d.
你的PR提交成功,感谢你对开源项目的贡献! |
@@ -48,6 +48,11 @@ struct EmbeddingCPUFunctor { | |||
dev_ctx_.template Alloc<T>(out_); | |||
auto* output = out_->data<T>(); | |||
|
|||
#if defined _OPENMP | |||
#ifndef PADDLE_WITH_CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could reformulate this conditions like: #if defined(_OPENMP) && !defined(PADDLE_WITH_CUDA)..... perhaps it would look more readable?
@jiangjiajun Please review |
你的PR已合入Paddle库,请关注后续测试结果。 |
PR types
Performance optimization
PR changes
OPs
Describe
This PR enables OMP multithreading on the lookup_table_v2 operator which makes the operator take less time when running the ernie 3.0 model with more than 1 thread,
The results of the speedup on SPR with FP32 datatype are as follows (Total execution time measured by the Paddle profiler. Before all the times were similar to the 1 thread time):
1 thread: 727.331
2 threads: 483.426
5 threads: 374.056
The change allows for a 2x speedup of the operator after increasing the number of threads.