Proposal: Switch to safer data_ptr API #654

Stonepia · 2024-07-26T07:09:51Z

🐛 Describe the bug

TL;DR

We should use safer data_ptr accessing API. Newer template APIs have additional checks. If possible, use tensor.mutable_data_ptr<T>() and tensor.const_data_ptr<T>(). Stop using <scalar_t>tensor.data_ptr().
Stop using char*. Either use int * or use the allocator directly for buffer accessing.

1. Use template data_ptr API for tensor

1.1 Existing code usage and its problem

Existing code uses raw pointers like this:

// non-template usage. Avoid using this
kernel_call(
    (scalar_t*)tensor.data_ptr();
);

The above usage doesn't check whether the tensor is initialized or not. When the tensor's storage is not initialized (for example, FakeTensor), this data_ptr is null. Then the kernel may try to write to a non-initialized memory, and causing a page fault.

1.2 template data_ptr

The template data_ptr will do additional checks. Like has_storage() and storage_initialized(). Please refer to the PyTorch TensorImpl.h for detail.

In view of that, change to a safer usage for old usage.

// template usage. Not encouraged.
input_.data_ptr<scalar_t>();

1.3 New API: `mutable_data_ptr()` and `const_data_ptr()`

PyTorch has introduced new APIs for tensor.data_ptr(). Their usage is the same with the template data_ptr. Their meaning is just as the name suggests.

// Use this
input_.const_data_ptr<scalar_t>();
input_.mutable_data_ptr<scalar_t>();

2. Avoid using `char *` data ptr for buffer

The existing code has the following pattern:

For example, the code snippet in torch-xpu-ops/...Reduce.h:

Tensor buffer;
Tensor semaphores;

buffer = at::empty(
        config.global_memory_size(),
        at::TensorOptions().dtype(kChar).device(kXPU));
semaphores = at::empty(
        config.semaphore_size(), at::TensorOptions().dtype(kChar).device(kXPU));
    at::detail::Array<char*, 1> data;

data[0] = (char*)semaphores.data_ptr();
buf_ptr = buffer.defined() ? (void*)buffer.data_ptr() : nullptr,

The template tensor.data_ptr<T> does not support the char type. Thus, we should avoid using this. We could align with PyTorch's writing usage with the following:

2.1 Use `int` instead of `char`

For example, PyTorch has the following code in Normalization.cuh

at::Tensor semaphores;
semaphores = at::zeros({grid.x}, input.options().dtype(at::kInt));

int* semaphores_ptr = grid.y > 1 ? semaphores.mutable_data_ptr<int>() : nullptr;

However, the above usage is not the best practice, this usage is constructing a tensor and using data in it. We could directly use an allocator. See section 2.2.

2.2 Use the allocator directly.

For example, for the code snippet in Reduce.cuh:

at::DataPtr buffer;
at::DataPtr semaphores;

if (config.should_global_reduce()) {
    auto& allocator = *c10::cuda::CUDACachingAllocator::get();
    buffer = allocator.allocate(config.global_memory_size());
    semaphores = allocator.allocate(config.semaphore_size());
   //...
}

auto reduce_kernel = ReduceOp<scalar_t, ops_t, uint32_t, out_scalar_t, vt0>(
       buffer.get(),
      (int*)semaphores.get(),
);

It directly uses the allocator, rather than constructing the tensor. I believe that this usage could have some performance gain.

The text was updated successfully, but these errors were encountered:

Stonepia · 2024-07-26T07:14:51Z

For the existing code, I changed the first part in PR : #655
The second part is not solved yet.

Stonepia added the enhancement New feature or request label Jul 26, 2024

intel locked and limited conversation to collaborators Jul 30, 2024

Stonepia converted this issue into discussion #660 Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Proposal: Switch to safer data_ptr API #654

Proposal: Switch to safer data_ptr API #654

Stonepia commented Jul 26, 2024 •

edited

Loading

Stonepia commented Jul 26, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Proposal: Switch to safer data_ptr API #654

Proposal: Switch to safer data_ptr API #654

Comments

Stonepia commented Jul 26, 2024 • edited Loading

🐛 Describe the bug

TL;DR

1. Use template data_ptr API for tensor

1.1 Existing code usage and its problem

1.2 template data_ptr

1.3 New API: mutable_data_ptr() and const_data_ptr()

2. Avoid using char * data ptr for buffer

2.1 Use int* instead of char*

2.2 Use the allocator directly.

Stonepia commented Jul 26, 2024

This issue was moved to a discussion.

Stonepia commented Jul 26, 2024 •

edited

Loading

1.3 New API: `mutable_data_ptr()` and `const_data_ptr()`

2. Avoid using `char *` data ptr for buffer

2.1 Use `int` instead of `char`