Changes to reduce kernel launch overheads #794

linehill · 2024-03-05T10:22:20Z

A patch set for reducing kernel launch overheads. These improved HeCBench's mrc, floydwarshall and overlay benchmarks by 41-57% on PVC.

Skip kernel info string construction when it is not going to be logged/displayed.
Reduce kernel launch overhead from __hipPushCallConfiguration() and __hipPopCallConfiguration() functions:
- Remove unnecessary backend initialization (backend will be initialized at hipLaunchKernel()).
- Avoid creating short living temporary, heap allocated chipStar::ExecItem.
Exit earlier from prepareDeviceVariablesNoLock().
Recycle OpenCL kernel handles instead of creating them for each exec-item.

pvelesko · 2024-03-13T00:16:43Z

Unit test failures for iGPU OpenCL

igpu_opencl_make_check_result.txt: FAIL
	377 - ABM_AddKernel_MultiTypeMultiSize - int (Failed)
	378 - ABM_AddKernel_MultiTypeMultiSize - long (SEGFAULT)
	380 - ABM_AddKernel_MultiTypeMultiSize - long long (Failed)

among others

linehill · 2024-03-15T09:27:07Z

igpu_opencl_make_check_result.txt: FAIL
	377 - ABM_AddKernel_MultiTypeMultiSize - int (Failed)
	378 - ABM_AddKernel_MultiTypeMultiSize - long (SEGFAULT)
	380 - ABM_AddKernel_MultiTypeMultiSize - long long (Failed)

What’s going on with those tests? They should be failing expectedly.

2024-03-08T03:28:15.1003549Z Name:                         	Intel(R) UHD Graphics 730
(...)
2024-03-08T03:28:15.1008640Z maxThreadsPerBlock:           	512
2024-03-08T03:28:15.1008861Z maxThreadsDim.x:              	512

Is this the iGPU on which the tests fail? The tests should be failing as they try to launch kernels over the supported thread block size (1000 vs. 512):

TEMPLATE_TEST_CASE("ABM_AddKernel_MultiTypeMultiSize", "", int, long, float, long long, double) {
  auto size = GENERATE(as<size_t>{}, 100, 500, 1000);
  // (...)
  hipLaunchKernelGGL(add<TestType>, 1, size, 0, 0, d_a, d_b, d_c, size);
  HIP_CHECK(hipGetLastError());

I see the tests failing on the main branch on my iGPU where thread block sizes limited to 256. If these tests are passing on the main branch on the CI’s iGPU - why are they passing in the first place?

pvelesko · 2024-03-16T03:02:38Z

@linehill tests are limited to 500 max so they don't fail on main for me not should they be failing. The test in question:

TEMPLATE_TEST_CASE("ABM_AddKernel_MultiTypeMultiSize", "", int, long, float, long long, double) {
  auto size = GENERATE(as<size_t>{}, 100, 500);
  TestType *d_a, *d_b, *d_c;
  auto res = hipMalloc(&d_a, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);
  res = hipMalloc(&d_b, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);
  res = hipMalloc(&d_c, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);

  std::vector<TestType> a, b, c;
  for (size_t i = 0; i < size; i++) {
    a.push_back(i + 1);
    b.push_back(i + 1);
    c.push_back(2 * (i + 1));
  }

  res = hipMemcpy(d_a, a.data(), sizeof(TestType) * size, hipMemcpyHostToDevice);
  REQUIRE(res == hipSuccess);
  res = hipMemcpy(d_b, b.data(), sizeof(TestType) * size, hipMemcpyHostToDevice);
  REQUIRE(res == hipSuccess);

  hipLaunchKernelGGL(add<TestType>, 1, size, 0, 0, d_a, d_b, d_c, size);
  HIP_CHECK(hipGetLastError());

  res = hipMemcpy(a.data(), d_c, sizeof(TestType) * size, hipMemcpyDeviceToHost);
  REQUIRE(res == hipSuccess);

  HIP_CHECK(hipFree(d_a));
  HIP_CHECK(hipFree(d_b));
  HIP_CHECK(hipFree(d_c));
  REQUIRE(a == c);
}

You probably looked in hip-tests which is not enabled by default.

pvelesko · 2024-03-16T03:03:44Z

iGPU where thread block sizes limited to 256

you just pasted a snippet where it says 512, not 256.

pvelesko · 2024-03-16T07:05:34Z

Is this the iGPU on which the tests fail?

yes, all tests that fail, fail on the iGPU OpenCL backend.

linehill · 2024-03-18T07:15:40Z

iGPU where thread block sizes limited to 256

you just pasted a snippet where it says 512, not 256.

512 is figure from CI's test log.

You probably looked in hip-tests which is not enabled by default.

I see, I thought chipStar switched to hip-tests.

linehill · 2024-03-18T11:41:13Z

No new changes, just rebase.

Skip kernel info string construction when it is not going to be logged/displayed.

Reduce kernel launch overhead from __hipPushCallConfiguration() and __hipPopCallConfiguration() functions: * Remove unnecessary backend initialization (backend will be initialized at hipLaunchKernel()). * Avoid creating short living temporary, heap allocated chipStar::ExecItem.

Exit earlier from prepareDeviceVariablesNoLock().

Recycle OpenCL kernel handles instead of creating them for each exec-item.

pvelesko

Not sure why linter complains for this PR

src/backend/OpenCL/CHIPBackendOpenCL.hh:42:10 [clang-diagnostic-error]
'CL/cl_ext.h' file not found

but other than that, great PR.

LGTM

pvelesko force-pushed the launch-overhead branch from 0e0d58d to 129dc77 Compare March 7, 2024 23:57

linehill force-pushed the launch-overhead branch from 129dc77 to 29e59c2 Compare March 15, 2024 09:27

linehill force-pushed the launch-overhead branch from 29e59c2 to 84c81c1 Compare March 18, 2024 11:40

linehill added 4 commits March 19, 2024 10:45

Redure kernel launch overhead - part 1

700afb8

Skip kernel info string construction when it is not going to be logged/displayed.

Reduce kernel launch overhead - part 3

f936393

Exit earlier from prepareDeviceVariablesNoLock().

Reduce kernel launch overhead - part 4

0ff1aff

Recycle OpenCL kernel handles instead of creating them for each exec-item.

pvelesko force-pushed the launch-overhead branch from 84c81c1 to 0ff1aff Compare March 19, 2024 01:45

pvelesko approved these changes Mar 19, 2024

View reviewed changes

pvelesko merged commit eeb1b44 into main Mar 19, 2024
28 of 29 checks passed

pvelesko deleted the launch-overhead branch March 19, 2024 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to reduce kernel launch overheads #794

Changes to reduce kernel launch overheads #794

linehill commented Mar 5, 2024

pvelesko commented Mar 13, 2024

linehill commented Mar 15, 2024

pvelesko commented Mar 16, 2024

pvelesko commented Mar 16, 2024

pvelesko commented Mar 16, 2024

linehill commented Mar 18, 2024

linehill commented Mar 18, 2024

pvelesko left a comment

Changes to reduce kernel launch overheads #794

Changes to reduce kernel launch overheads #794

Conversation

linehill commented Mar 5, 2024

pvelesko commented Mar 13, 2024

linehill commented Mar 15, 2024

pvelesko commented Mar 16, 2024

pvelesko commented Mar 16, 2024

pvelesko commented Mar 16, 2024

linehill commented Mar 18, 2024

linehill commented Mar 18, 2024

pvelesko left a comment

Choose a reason for hiding this comment