-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix reduce_any kernel data race on sharedMem #47233
fix reduce_any kernel data race on sharedMem #47233
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
int tid = threadIdx.y * blockDim.x + threadIdx.x; | ||
int wid = tid / kWarpSize; | ||
int lane, tid, wid, n; | ||
if (kWarpSize == 32 || kWarpSize == 64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kWarpSize一定满足这个条件吧,还需要这个判断吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原考虑NV后面架构、warpSize的大小或许变动,保持了原来计算的分支,后面将删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…pd/Paddle into fix_reduce_any_data_race
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for CI-OP-Benchmark
int lane, tid, wid, bid, n; | ||
// Bit operation can be used when kWarpSize is 32 or 64 now | ||
n = kWarpSize == 32 ? 5 : 6; | ||
block_dim_x = blockDim.x >> n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的n = kWarpSize == 32 ? 5 : 6;
存在两处可以继续修改的地方:
n
作为右移位的规模,可以改为int rshift_val
- 这个数字可以在编译过程中判断,所以可以改为:
constexpr int rshift_val = (kWarpSize != 32) ? ((kWarpSize == 64) ? 6 : 5) : 5;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
int tid = threadIdx.y * blockDim.x + threadIdx.x; | ||
int wid = tid / kWarpSize; | ||
int bid = threadIdx.y; | ||
int lane, tid, wid, bid, n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
代码习惯上的小建议:变量声明和赋值分离,且声明时未赋初值,会影响代码的维护和问题排查;建议改回原来的直接声明时赋值的写法
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前因if分支考虑到变量生命周期,导致变量声明和赋值分离,现已取消分支,按照建议修改。done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for CI-OP-Benchmark
@niuliling123 请大佬review下对于Kps的修改; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
等后续的reviewer全部通过后可以合入
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Veried, no more errors reported from |
PR types
Bug fixes
PR changes
OPs
Describe
分析并解决了 issue: Potential race-condition in phi::funcs::ReduceAnyKernel #46974 在sharedMem中出现data race的问题,
val = shared[bid * block_dim_x + lane];
读取了sharedMem中的数据,后续的CudaShuffleDownSync
虽然进行了 warp内的同步,但是不同warp间的线程未同步,而紧接着的shared[threadIdx.y] = val;
在sharedMem上写数据时,与前述的sharedMem中读取数据的行为产生了data race.kp 涉及的文件较多,ci任务超时,benchmark本地测得的性能数据如下图,个别显卡中间被别的任务占用,部分数据不准确。
使用位运算替代除法运算、取模运算,可带来小幅度性能提升 。