[REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events #425

harrism · 2020-06-24T03:51:13Z

This closes #410 but in a different (better) way than #416. This PR uses thread-local storage to create a unique CUDA event for each thread's default-stream in the PTDS case. It also attempts to unify the code paths for PTDS and non-PTDS as much as possible; however the overhead of calling cudaEventRecord on every deallocation proves expensive so it is currently disabled in the non-PTDS case. (#416 instead creates an event for every free'd block and maintains lists of events that must be synced when memory is allocated from a coalesced block. These lists proved to have high performance and code complexity overhead.)

This PR also

makes pool_memory_resource thread safe (so thread_safe_resource_adaptor is no longer required to use this MR)
Refactors device_mr tests to facilitate code reuse and adds multithreaded tests.
Adds DEVICE_MR_PTDS_TEST -- multithreaded tests with per-thread default stream enabled
Adds a multithreaded test that allocates and frees memory on different threads.

TODO: benchmark with a playback of a multithreaded/PTDS log.

In benchmarks, I find that when compiled without PTDS support, this version is nearly the same performance as before. When compiled with PTDS support but run on a single thread (using only stream 0), performance is degraded by at most 2.4x, and in most cases at most 2x.

Compiled without --default-stream-per-thread, here we compare performance of the original branch-0.15 pool_memory_resource to this PR (numbers are % increase in run time -- so mostly within 6%).

(rapids) rapids@compose:~/rmm/build/release$ googlebenchmark/googlebenchmark/tools/compare.py benchmarks rab_0.15.json rab_ptds_off.json 
Comparing rab_0.15.json to rab_ptds_off.json
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations<pool_mr>/1000/1                     +0.0135         +0.0134             0             0             0             0
BM_RandomAllocations<pool_mr>/1000/4                     +0.0329         +0.0328             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/64                    -0.0100         -0.0100             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/256                   +0.0390         +0.0385             0             0             0             0
BM_RandomAllocations<pool_mr>/1000/1024                  +0.0495         +0.0494             0             0             0             0
BM_RandomAllocations<pool_mr>/1000/4096                  -0.0083         -0.0089             0             0             0             0
BM_RandomAllocations<pool_mr>/10000/1                    -0.0137         -0.0136            71            70            71            70
BM_RandomAllocations<pool_mr>/10000/4                    +0.0056         +0.0056            74            74            74            74
BM_RandomAllocations<pool_mr>/10000/64                   +0.0297         +0.0299            11            11            11            11
BM_RandomAllocations<pool_mr>/10000/256                  +0.0322         +0.0322             4             5             4             5
BM_RandomAllocations<pool_mr>/10000/1024                 +0.0599         +0.0600             3             3             3             3
BM_RandomAllocations<pool_mr>/10000/4096                 -0.0127         +0.0024             5             5             5             5
BM_RandomAllocations<pool_mr>/100000/1                   +0.0180         +0.0179         10419         10607         10417         10604
BM_RandomAllocations<pool_mr>/100000/4                   +0.0618         +0.0617          3240          3440          3239          3439
BM_RandomAllocations<pool_mr>/100000/64                  +0.0103         +0.0103           124           126           124           126
BM_RandomAllocations<pool_mr>/100000/256                 +0.0387         +0.0386            45            46            45            46
BM_RandomAllocations<pool_mr>/100000/1024                +0.0609         +0.0607            29            31            29            31
BM_RandomAllocations<pool_mr>/100000/4096                -0.0099         -0.0097            31            31            28            28

Compiled with --default-stream-per-thread:

(rapids) rapids@compose:~/rmm/build/release$ googlebenchmark/googlebenchmark/tools/compare.py benchmarks rab_0.15.json rab_pool_ptds_on.json 
Comparing rab_0.15.json to rab_pool_ptds_on.json
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations<pool_mr>/1000/1                     +0.5836         +0.5835             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/4                     +0.6111         +0.6111             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/64                    +0.5470         +0.5468             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/256                   +0.7292         +0.7291             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/1024                  +0.9819         +0.9820             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/4096                  +1.2496         +1.1654             0             1             0             0
BM_RandomAllocations<pool_mr>/10000/1                    +0.0331         +0.0332            71            74            71            74
BM_RandomAllocations<pool_mr>/10000/4                    +0.0510         +0.0510            74            78            74            78
BM_RandomAllocations<pool_mr>/10000/64                   +0.3629         +0.3632            11            15            11            15
BM_RandomAllocations<pool_mr>/10000/256                  +0.6303         +0.6306             4             7             4             7
BM_RandomAllocations<pool_mr>/10000/1024                 +0.9057         +0.9058             3             6             3             6
BM_RandomAllocations<pool_mr>/10000/4096                 +0.5270         +0.5523             5             8             5             7
BM_RandomAllocations<pool_mr>/100000/1                   +0.0091         +0.0091         10419         10514         10417         10512
BM_RandomAllocations<pool_mr>/100000/4                   +0.0345         +0.0345          3240          3351          3239          3351
BM_RandomAllocations<pool_mr>/100000/64                  +0.3326         +0.3327           124           166           124           166
BM_RandomAllocations<pool_mr>/100000/256                 +0.6305         +0.6306            45            73            45            73
BM_RandomAllocations<pool_mr>/100000/1024                +0.9534         +0.9534            29            58            29            58
BM_RandomAllocations<pool_mr>/100000/4096                +0.9593         +0.9443            31            61            28            55

In comparison to cnmem compiled with PTDS on, pool_mr can be up to 2.3x slower, but on the other hand it can be nearly 2x faster. And I think in real apps it will be better because cnmem synchronizes the whole device on every allocation when PTDS is enabled, whereas pool_mr will only insert cudaStreamWaitEvent calls.

(rapids) rapids@compose:~/rmm/build/release$ googlebenchmark/googlebenchmark/tools/compare.py benchmarks rab_cnmem_ptds_on.json rab_pool_ptds_on.json 
Comparing rab_cnmem_ptds_on.json to rab_pool_ptds_on.json
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations<pool_mr>/1000/1                     +0.1900         +0.1900             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/4                     +0.1952         +0.1951             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/64                    +0.0560         +0.0567             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/256                   +0.5657         +0.5661             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/1024                  +0.8971         +0.8978             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/4096                  +1.2967         +1.0264             0             1             0             0
BM_RandomAllocations<pool_mr>/10000/1                    -0.2384         -0.2384            97            74            97            74
BM_RandomAllocations<pool_mr>/10000/4                    -0.2434         -0.2434           103            78           103            78
BM_RandomAllocations<pool_mr>/10000/64                   -0.2676         -0.2676            20            15            20            15
BM_RandomAllocations<pool_mr>/10000/256                  +0.5197         +0.5197             5             7             5             7
BM_RandomAllocations<pool_mr>/10000/1024                 +0.8256         +0.8256             3             6             3             6
BM_RandomAllocations<pool_mr>/10000/4096                 +1.3167         +1.0803             3             8             3             7
BM_RandomAllocations<pool_mr>/100000/1                   -0.4705         -0.4705         19858         10514         19852         10512
BM_RandomAllocations<pool_mr>/100000/4                   -0.4059         -0.4058          5641          3351          5639          3351
BM_RandomAllocations<pool_mr>/100000/64                  -0.2270         -0.2270           214           166           214           166
BM_RandomAllocations<pool_mr>/100000/256                 +0.4937         +0.4937            49            73            49            73
BM_RandomAllocations<pool_mr>/100000/1024                +0.8427         +0.8427            31            58            31            58
BM_RandomAllocations<pool_mr>/100000/4096                +1.1488         +0.9514            28            61            28            55

…DS code paths.

GPUtester · 2020-06-24T03:51:38Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

include/rmm/mr/device/pool_memory_resource.hpp

jrhemstad · 2020-06-24T13:29:58Z

include/rmm/mr/device/pool_memory_resource.hpp

+  cudaEvent_t get_event(cudaStream_t stream)
+  {
+#ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM
+    if (stream == 0) {


I think we need to do something other than check against 0 for the default stream. Someone could pass cudaStreamPerThread explicitly, which is different from 0.

Yeah, thought about that. Need to check for either 0 or cudaStreamPerThread, good point.

jrhemstad

Some optimization ideas that are orthogonal to PTDS, but the benefit is that it should improve performance for both PTDS and non-PTDS mode.

jrhemstad · 2020-06-24T13:48:48Z

include/rmm/mr/device/pool_memory_resource.hpp

+        stream_free_blocks_[stream_event].insert(blocks.begin(), blocks.end());
+        stream_free_blocks_.erase(blocks_event);


It feels like there is a potential optimization here. We're merging two sorted std::lists and coalescing adjacent blocks. If we ignored the fact that we're coalescing blocks, you could just use std::list::merge .

The current implementation is O(m*n) as it requires doing a linear search of the destination list for every element in the source list. It seems like we should be able to make that be O(m + n).

At the very least, I think we should add a free_list::merge(free_list&& other) function and use that here. Then we can explore doing something more optimal that exploits the fact that both lists are already sorted by pointer.

One option would be to just use std::list::merge and ignore coalescing, and then do a second pass through the merged list and coalesce any adjacent blocks. But that operation likely can be fused with a custom merge algorithm.

include/rmm/mr/device/pool_memory_resource.hpp

…re the same infrastructure.

…nstructors.

trevorsm7

Not finished reviewing; flushing some comments I had so far.

include/rmm/mr/device/pool_memory_resource.hpp

jlowe · 2020-07-02T15:54:30Z

tests/mr/device/mr_multithreaded_tests.cpp

+
+TYPED_TEST(MRTest_mt, AllocFreeDifferentThreadsSameStream)
+{
+  test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream);


I think this needs to be:

Suggested change

test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream);

test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream, this->stream);

otherwise streamB will default to 0 and this does not test different threads on the same stream as the test name implies. I think the default arguments for test_allocate_free_different_threads are causing more harm than good.

Good catch!

rongou · 2020-07-03T00:42:50Z

I patched in this PR and ran the TPCH job a bunch of times. I didn't see the occasional memory errors I was getting from CNMeM. Looks solid!

One thing that threw me for a loop was that after I changed the JNI code to use pool_memory_resource, the build complained about not finding librmm. Turns out once we get rid of CNMeM we no longer need to load librmm.so.

kkraus14 · 2020-07-03T01:15:37Z

Turns out once we get rid of CNMeM we no longer need to load librmm.so.

Yup, once CNMeM is gone rmm will move to being a header only C++ library 🎉

trevorsm7

I noticed a copyright that can be updated. Looks good otherwise.

tests/mr/device/mr_test.hpp

jrhemstad

The new design inextricably links a single CUDA event to every stream (default or otherwise). However, there are a few places where those two pieces of state (the stream and its event) are allowed to be independent variables. I think it would make the code clearer and safer to keep those things tightly coupled when it's expected that they relate to each other by adding a simple wrapper class for a stream and an event:

struct stream_event{
   cudaStream_t stream;
   cudaEvent_t event;
};

include/rmm/mr/device/pool_memory_resource.hpp

harrism added 4 commits June 22, 2020 15:33

Use per-thread events rather than per-block events

e2be0ba

Disable event timing; cleanup

5514703

Add thread-local unique_event inner class, streamline PTDS and non-PT…

93bebd7

…DS code paths.

TODO comment

2a18521

harrism added the 2 - In Progress Currently a work in progress label Jun 24, 2020

harrism requested a review from jrhemstad June 24, 2020 03:51

harrism self-assigned this Jun 24, 2020

harrism commented Jun 24, 2020

View reviewed changes

include/rmm/mr/device/pool_memory_resource.hpp Show resolved Hide resolved

harrism commented Jun 24, 2020

View reviewed changes

include/rmm/mr/device/pool_memory_resource.hpp Show resolved Hide resolved

harrism added 2 commits June 24, 2020 14:11

Simplify event wrapper class and remove ids

78467d5

Changelog for rapidsai#425

4110fa0

harrism mentioned this pull request Jun 24, 2020

[FEA] Add per-thread default stream support to pool_memory_resource #410

Closed

jrhemstad requested changes Jun 24, 2020

View reviewed changes

harrism added 4 commits June 25, 2020 14:25

get_event looks for 0 or cudaStreamPerThread

29d65f9

Synchronize on event destruction. Clean up docs.

e53a59e

remove_event->destroy_event

3d178ff

Be consistent with the default location for blocks

889bf55

jrhemstad reviewed Jun 25, 2020

View reviewed changes

include/rmm/mr/device/pool_memory_resource.hpp Show resolved Hide resolved

harrism added 8 commits June 26, 2020 12:25

Refactor mr_tests.cpp to enable creating multithreaded tests that sha…

0d314b9

…re the same infrastructure.

Changes missed in previous commit

b1620d9

Add multithreaded tests (some still failing)

07af3c8

Fix failing multithreaded tests by adding specializations for test co…

714e924

…nstructors.

Make pool_memory_resource thread safe and fix PTDS destroy_event bug

c14d1af

Add DEVICE_MR_PTDS_TEST

2ffbcaa

include mutex always

ccea946

Merge branch 'branch-0.15' into fea-ptds-events

af3b9c7

harrism added 3 commits July 1, 2020 15:42

Document that this is now thread-safe and PTDS-compatible

1d24547

Improve changelog

0d7cb97

Merge branch 'branch-0.15' into fea-ptds-events

fc133ae

harrism marked this pull request as ready for review July 1, 2020 05:44

harrism requested a review from a team as a code owner July 1, 2020 05:44

harrism changed the title ~~Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events~~ [REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events Jul 1, 2020

harrism added 3 - Ready for review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 1, 2020

jakirkham mentioned this pull request Jul 2, 2020

Support for CUDA streams rapidsai/dask-cuda#96

Open

Fix gcc7 compilation failure

4af5652

trevorsm7 suggested changes Jul 2, 2020

View reviewed changes

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved

harrism added 3 commits July 2, 2020 14:25

Address review suggestions.

6e9bac0

Only add event to ptds_events_ once!

26c2909

Merge branch 'multi-thread-replay' into fea-ptds-events

01385bc

jlowe reviewed Jul 2, 2020

View reviewed changes

rongou mentioned this pull request Jul 2, 2020

[FEA] Multiple threads sharing the same GPU NVIDIA/spark-rapids#15

Closed

Fix streams passed to mutithreaded test

3df09f9

rongou mentioned this pull request Jul 3, 2020

[REVIEW] switch from CNMeM to pool_memory_resource [skip ci] rapidsai/cudf#5632

Merged

trevorsm7 approved these changes Jul 3, 2020

View reviewed changes

tests/mr/device/mr_test.hpp Outdated Show resolved Hide resolved

jrhemstad approved these changes Jul 10, 2020

View reviewed changes

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved

harrism added 3 commits July 10, 2020 12:31

Update copyright

13486e7

Combine streams and events in a struct.

0a57993

Merge branch 'branch-0.15' into fea-ptds-events

94cae03

harrism mentioned this pull request Jul 10, 2020

[REVIEW] Add RMM_ASSERT_CUDA_SUCCESS utility macro #429

Merged

harrism merged commit 051435f into rapidsai:branch-0.15 Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events #425

[REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events #425

harrism commented Jun 24, 2020 •

edited

Loading

GPUtester commented Jun 24, 2020

jrhemstad Jun 24, 2020

harrism Jun 24, 2020

harrism Jun 25, 2020

jrhemstad left a comment

jrhemstad Jun 24, 2020

jrhemstad Jun 24, 2020

trevorsm7 left a comment

jlowe Jul 2, 2020

harrism Jul 2, 2020

rongou commented Jul 3, 2020

kkraus14 commented Jul 3, 2020

trevorsm7 left a comment

jrhemstad left a comment

		stream_free_blocks_[stream_event].insert(blocks.begin(), blocks.end());
		stream_free_blocks_.erase(blocks_event);

	test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream);
	test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream, this->stream);

[REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events #425

[REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events #425

Conversation

harrism commented Jun 24, 2020 • edited Loading

GPUtester commented Jun 24, 2020

jrhemstad Jun 24, 2020

Choose a reason for hiding this comment

harrism Jun 24, 2020

Choose a reason for hiding this comment

harrism Jun 25, 2020

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

jrhemstad Jun 24, 2020

Choose a reason for hiding this comment

jrhemstad Jun 24, 2020

Choose a reason for hiding this comment

trevorsm7 left a comment

Choose a reason for hiding this comment

jlowe Jul 2, 2020

Choose a reason for hiding this comment

harrism Jul 2, 2020

Choose a reason for hiding this comment

rongou commented Jul 3, 2020

kkraus14 commented Jul 3, 2020

trevorsm7 left a comment

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

harrism commented Jun 24, 2020 •

edited

Loading