Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
index_select: Optimizing the kernel with reducing for-loops in Tensor…
…Info OffsetCalculator (#924) Two reasons for the slow perf in index_select 1. We used static loops times 12 2. We used int64_t for offset index, PVC doesn't have long datatype instruction, so it takes about 30us for once offset calculation. So we have following optimization in this pr: 1, aligned CUDA, using dynamic loop boundry 2, optimized offset calculator #816 We got 2x perf improvement in index_select ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU Self XPU % XPU total XPU time avg # of Calls aten::index_select 17.34% 2.161ms 41.05% 5.115ms 85.257us 12.734ms 100.00% 12.734ms 212.237us 60 --------- Signed-off-by: majing <Jing1.Ma@intel.com> Co-authored-by: Feng Yuan <feng1.yuan@intel.com>
- Loading branch information