Skip to content

Commit

Permalink
index_select: Optimizing the kernel with reducing for-loops in Tensor…
Browse files Browse the repository at this point in the history
…Info OffsetCalculator (#924)

Two reasons for the slow perf in index_select

1. We used static loops times 12
2. We used int64_t for offset index, PVC doesn't have long datatype
instruction, so it takes about 30us for once offset calculation.

So we have following optimization in this pr:
1, aligned CUDA, using dynamic loop boundry
2, optimized offset calculator

#816

We got 2x perf improvement in index_select
------------------------------------------------------- ------------
------------ ------------ ------------ ------------ ------------
------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self XPU
Self XPU % XPU total XPU time avg # of Calls

aten::index_select 17.34% 2.161ms 41.05% 5.115ms 85.257us 12.734ms
100.00% 12.734ms 212.237us 60

---------

Signed-off-by: majing <Jing1.Ma@intel.com>
Co-authored-by: Feng Yuan <feng1.yuan@intel.com>
  • Loading branch information
majing921201 and fengyuan14 authored Sep 26, 2024
1 parent 0ab67fb commit d9ae62d
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 9 deletions.
15 changes: 6 additions & 9 deletions src/comm/TensorInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -137,16 +137,13 @@ struct IndexToOffset {
return linearId;
}

#pragma unroll
for (int dim = XPU_MAX_TENSORINFO_DIMS - 1; dim >= 0; --dim) {
if (dim < info.dims) {
auto divider = at::detail::IntDivider<IndexType>(info.sizes[dim]);
auto divmod = divider.divmod(linearId);
linearId = divmod.div;
offset += divmod.mod * info.strides[dim];
}
for (int dim = info.dims - 1; dim > 0; --dim) {
IndexType curDimIndex = linearId % info.sizes[dim];
IndexType curDimOffset = curDimIndex * info.strides[dim];
offset += curDimOffset;
linearId /= info.sizes[dim];
}
return offset;
return offset + linearId * info.strides[0];
}
};

Expand Down
3 changes: 3 additions & 0 deletions test/xpu/skip_list_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1154,6 +1154,9 @@
"test_sequential_pin_memory",
"test_shuffle_pin_memory",
"test_pin_memory",
# failed in preci
# https://github.com/intel/torch-xpu-ops/issues/928
"test_segfault",
),

"test_tensor_creation_ops_xpu.py": (
Expand Down

0 comments on commit d9ae62d

Please sign in to comment.