Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to reduce kernel launch overheads #794

Merged
merged 4 commits into from
Mar 19, 2024
Merged

Changes to reduce kernel launch overheads #794

merged 4 commits into from
Mar 19, 2024

Conversation

linehill
Copy link
Collaborator

@linehill linehill commented Mar 5, 2024

A patch set for reducing kernel launch overheads. These improved HeCBench's mrc, floydwarshall and overlay benchmarks by 41-57% on PVC.

  • Skip kernel info string construction when it is not going to be logged/displayed.
  • Reduce kernel launch overhead from __hipPushCallConfiguration() and __hipPopCallConfiguration() functions:
    • Remove unnecessary backend initialization (backend will be initialized at hipLaunchKernel()).
    • Avoid creating short living temporary, heap allocated chipStar::ExecItem.
  • Exit earlier from prepareDeviceVariablesNoLock().
  • Recycle OpenCL kernel handles instead of creating them for each exec-item.

@pvelesko
Copy link
Collaborator

Unit test failures for iGPU OpenCL

igpu_opencl_make_check_result.txt: FAIL
	377 - ABM_AddKernel_MultiTypeMultiSize - int (Failed)
	378 - ABM_AddKernel_MultiTypeMultiSize - long (SEGFAULT)
	380 - ABM_AddKernel_MultiTypeMultiSize - long long (Failed)

among others

@linehill
Copy link
Collaborator Author

igpu_opencl_make_check_result.txt: FAIL
	377 - ABM_AddKernel_MultiTypeMultiSize - int (Failed)
	378 - ABM_AddKernel_MultiTypeMultiSize - long (SEGFAULT)
	380 - ABM_AddKernel_MultiTypeMultiSize - long long (Failed)

What’s going on with those tests? They should be failing expectedly.

2024-03-08T03:28:15.1003549Z Name:                         	Intel(R) UHD Graphics 730
(...)
2024-03-08T03:28:15.1008640Z maxThreadsPerBlock:           	512
2024-03-08T03:28:15.1008861Z maxThreadsDim.x:              	512

Is this the iGPU on which the tests fail? The tests should be failing as they try to launch kernels over the supported thread block size (1000 vs. 512):

TEMPLATE_TEST_CASE("ABM_AddKernel_MultiTypeMultiSize", "", int, long, float, long long, double) {
  auto size = GENERATE(as<size_t>{}, 100, 500, 1000);
  // (...)
  hipLaunchKernelGGL(add<TestType>, 1, size, 0, 0, d_a, d_b, d_c, size);
  HIP_CHECK(hipGetLastError());

I see the tests failing on the main branch on my iGPU where thread block sizes limited to 256. If these tests are passing on the main branch on the CI’s iGPU - why are they passing in the first place?

@pvelesko
Copy link
Collaborator

@linehill tests are limited to 500 max so they don't fail on main for me not should they be failing. The test in question:

TEMPLATE_TEST_CASE("ABM_AddKernel_MultiTypeMultiSize", "", int, long, float, long long, double) {
  auto size = GENERATE(as<size_t>{}, 100, 500);
  TestType *d_a, *d_b, *d_c;
  auto res = hipMalloc(&d_a, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);
  res = hipMalloc(&d_b, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);
  res = hipMalloc(&d_c, sizeof(TestType) * size);
  REQUIRE(res == hipSuccess);

  std::vector<TestType> a, b, c;
  for (size_t i = 0; i < size; i++) {
    a.push_back(i + 1);
    b.push_back(i + 1);
    c.push_back(2 * (i + 1));
  }

  res = hipMemcpy(d_a, a.data(), sizeof(TestType) * size, hipMemcpyHostToDevice);
  REQUIRE(res == hipSuccess);
  res = hipMemcpy(d_b, b.data(), sizeof(TestType) * size, hipMemcpyHostToDevice);
  REQUIRE(res == hipSuccess);

  hipLaunchKernelGGL(add<TestType>, 1, size, 0, 0, d_a, d_b, d_c, size);
  HIP_CHECK(hipGetLastError());

  res = hipMemcpy(a.data(), d_c, sizeof(TestType) * size, hipMemcpyDeviceToHost);
  REQUIRE(res == hipSuccess);

  HIP_CHECK(hipFree(d_a));
  HIP_CHECK(hipFree(d_b));
  HIP_CHECK(hipFree(d_c));
  REQUIRE(a == c);
}

You probably looked in hip-tests which is not enabled by default.

@pvelesko
Copy link
Collaborator

iGPU where thread block sizes limited to 256

you just pasted a snippet where it says 512, not 256.

@pvelesko
Copy link
Collaborator

Is this the iGPU on which the tests fail?

yes, all tests that fail, fail on the iGPU OpenCL backend.

@linehill
Copy link
Collaborator Author

iGPU where thread block sizes limited to 256

you just pasted a snippet where it says 512, not 256.

512 is figure from CI's test log.

You probably looked in hip-tests which is not enabled by default.

I see, I thought chipStar switched to hip-tests.

@linehill
Copy link
Collaborator Author

No new changes, just rebase.

Skip kernel info string construction when it is not going to be
logged/displayed.
Reduce kernel launch overhead from __hipPushCallConfiguration() and
__hipPopCallConfiguration() functions:

* Remove unnecessary backend initialization (backend will be
  initialized at hipLaunchKernel()).

* Avoid creating short living temporary, heap allocated
  chipStar::ExecItem.
Exit earlier from prepareDeviceVariablesNoLock().
Recycle OpenCL kernel handles instead of creating them for each exec-item.
Copy link
Collaborator

@pvelesko pvelesko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why linter complains for this PR

src/backend/OpenCL/CHIPBackendOpenCL.hh:42:10 [clang-diagnostic-error]
'CL/cl_ext.h' file not found

but other than that, great PR.

LGTM

@pvelesko pvelesko merged commit eeb1b44 into main Mar 19, 2024
28 of 29 checks passed
@pvelesko pvelesko deleted the launch-overhead branch March 19, 2024 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants