Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events #425

Merged
merged 33 commits into from
Jul 10, 2020

Conversation

harrism
Copy link
Member

@harrism harrism commented Jun 24, 2020

This closes #410 but in a different (better) way than #416. This PR uses thread-local storage to create a unique CUDA event for each thread's default-stream in the PTDS case. It also attempts to unify the code paths for PTDS and non-PTDS as much as possible; however the overhead of calling cudaEventRecord on every deallocation proves expensive so it is currently disabled in the non-PTDS case. (#416 instead creates an event for every free'd block and maintains lists of events that must be synced when memory is allocated from a coalesced block. These lists proved to have high performance and code complexity overhead.)

This PR also

  • makes pool_memory_resource thread safe (so thread_safe_resource_adaptor is no longer required to use this MR)
  • Refactors device_mr tests to facilitate code reuse and adds multithreaded tests.
  • Adds DEVICE_MR_PTDS_TEST -- multithreaded tests with per-thread default stream enabled
  • Adds a multithreaded test that allocates and frees memory on different threads.

TODO: benchmark with a playback of a multithreaded/PTDS log.

In benchmarks, I find that when compiled without PTDS support, this version is nearly the same performance as before. When compiled with PTDS support but run on a single thread (using only stream 0), performance is degraded by at most 2.4x, and in most cases at most 2x.

Compiled without --default-stream-per-thread, here we compare performance of the original branch-0.15 pool_memory_resource to this PR (numbers are % increase in run time -- so mostly within 6%).

(rapids) rapids@compose:~/rmm/build/release$ googlebenchmark/googlebenchmark/tools/compare.py benchmarks rab_0.15.json rab_ptds_off.json 
Comparing rab_0.15.json to rab_ptds_off.json
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations<pool_mr>/1000/1                     +0.0135         +0.0134             0             0             0             0
BM_RandomAllocations<pool_mr>/1000/4                     +0.0329         +0.0328             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/64                    -0.0100         -0.0100             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/256                   +0.0390         +0.0385             0             0             0             0
BM_RandomAllocations<pool_mr>/1000/1024                  +0.0495         +0.0494             0             0             0             0
BM_RandomAllocations<pool_mr>/1000/4096                  -0.0083         -0.0089             0             0             0             0
BM_RandomAllocations<pool_mr>/10000/1                    -0.0137         -0.0136            71            70            71            70
BM_RandomAllocations<pool_mr>/10000/4                    +0.0056         +0.0056            74            74            74            74
BM_RandomAllocations<pool_mr>/10000/64                   +0.0297         +0.0299            11            11            11            11
BM_RandomAllocations<pool_mr>/10000/256                  +0.0322         +0.0322             4             5             4             5
BM_RandomAllocations<pool_mr>/10000/1024                 +0.0599         +0.0600             3             3             3             3
BM_RandomAllocations<pool_mr>/10000/4096                 -0.0127         +0.0024             5             5             5             5
BM_RandomAllocations<pool_mr>/100000/1                   +0.0180         +0.0179         10419         10607         10417         10604
BM_RandomAllocations<pool_mr>/100000/4                   +0.0618         +0.0617          3240          3440          3239          3439
BM_RandomAllocations<pool_mr>/100000/64                  +0.0103         +0.0103           124           126           124           126
BM_RandomAllocations<pool_mr>/100000/256                 +0.0387         +0.0386            45            46            45            46
BM_RandomAllocations<pool_mr>/100000/1024                +0.0609         +0.0607            29            31            29            31
BM_RandomAllocations<pool_mr>/100000/4096                -0.0099         -0.0097            31            31            28            28

Compiled with --default-stream-per-thread:

(rapids) rapids@compose:~/rmm/build/release$ googlebenchmark/googlebenchmark/tools/compare.py benchmarks rab_0.15.json rab_pool_ptds_on.json 
Comparing rab_0.15.json to rab_pool_ptds_on.json
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations<pool_mr>/1000/1                     +0.5836         +0.5835             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/4                     +0.6111         +0.6111             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/64                    +0.5470         +0.5468             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/256                   +0.7292         +0.7291             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/1024                  +0.9819         +0.9820             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/4096                  +1.2496         +1.1654             0             1             0             0
BM_RandomAllocations<pool_mr>/10000/1                    +0.0331         +0.0332            71            74            71            74
BM_RandomAllocations<pool_mr>/10000/4                    +0.0510         +0.0510            74            78            74            78
BM_RandomAllocations<pool_mr>/10000/64                   +0.3629         +0.3632            11            15            11            15
BM_RandomAllocations<pool_mr>/10000/256                  +0.6303         +0.6306             4             7             4             7
BM_RandomAllocations<pool_mr>/10000/1024                 +0.9057         +0.9058             3             6             3             6
BM_RandomAllocations<pool_mr>/10000/4096                 +0.5270         +0.5523             5             8             5             7
BM_RandomAllocations<pool_mr>/100000/1                   +0.0091         +0.0091         10419         10514         10417         10512
BM_RandomAllocations<pool_mr>/100000/4                   +0.0345         +0.0345          3240          3351          3239          3351
BM_RandomAllocations<pool_mr>/100000/64                  +0.3326         +0.3327           124           166           124           166
BM_RandomAllocations<pool_mr>/100000/256                 +0.6305         +0.6306            45            73            45            73
BM_RandomAllocations<pool_mr>/100000/1024                +0.9534         +0.9534            29            58            29            58
BM_RandomAllocations<pool_mr>/100000/4096                +0.9593         +0.9443            31            61            28            55

In comparison to cnmem compiled with PTDS on, pool_mr can be up to 2.3x slower, but on the other hand it can be nearly 2x faster. And I think in real apps it will be better because cnmem synchronizes the whole device on every allocation when PTDS is enabled, whereas pool_mr will only insert cudaStreamWaitEvent calls.

(rapids) rapids@compose:~/rmm/build/release$ googlebenchmark/googlebenchmark/tools/compare.py benchmarks rab_cnmem_ptds_on.json rab_pool_ptds_on.json 
Comparing rab_cnmem_ptds_on.json to rab_pool_ptds_on.json
Benchmark                                                   Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations<pool_mr>/1000/1                     +0.1900         +0.1900             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/4                     +0.1952         +0.1951             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/64                    +0.0560         +0.0567             1             1             1             1
BM_RandomAllocations<pool_mr>/1000/256                   +0.5657         +0.5661             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/1024                  +0.8971         +0.8978             0             1             0             1
BM_RandomAllocations<pool_mr>/1000/4096                  +1.2967         +1.0264             0             1             0             0
BM_RandomAllocations<pool_mr>/10000/1                    -0.2384         -0.2384            97            74            97            74
BM_RandomAllocations<pool_mr>/10000/4                    -0.2434         -0.2434           103            78           103            78
BM_RandomAllocations<pool_mr>/10000/64                   -0.2676         -0.2676            20            15            20            15
BM_RandomAllocations<pool_mr>/10000/256                  +0.5197         +0.5197             5             7             5             7
BM_RandomAllocations<pool_mr>/10000/1024                 +0.8256         +0.8256             3             6             3             6
BM_RandomAllocations<pool_mr>/10000/4096                 +1.3167         +1.0803             3             8             3             7
BM_RandomAllocations<pool_mr>/100000/1                   -0.4705         -0.4705         19858         10514         19852         10512
BM_RandomAllocations<pool_mr>/100000/4                   -0.4059         -0.4058          5641          3351          5639          3351
BM_RandomAllocations<pool_mr>/100000/64                  -0.2270         -0.2270           214           166           214           166
BM_RandomAllocations<pool_mr>/100000/256                 +0.4937         +0.4937            49            73            49            73
BM_RandomAllocations<pool_mr>/100000/1024                +0.8427         +0.8427            31            58            31            58
BM_RandomAllocations<pool_mr>/100000/4096                +1.1488         +0.9514            28            61            28            55

@harrism harrism added the 2 - In Progress Currently a work in progress label Jun 24, 2020
@harrism harrism requested a review from jrhemstad June 24, 2020 03:51
@harrism harrism self-assigned this Jun 24, 2020
@GPUtester
Copy link
Contributor

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

cudaEvent_t get_event(cudaStream_t stream)
{
#ifdef CUDA_API_PER_THREAD_DEFAULT_STREAM
if (stream == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to do something other than check against 0 for the default stream. Someone could pass cudaStreamPerThread explicitly, which is different from 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thought about that. Need to check for either 0 or cudaStreamPerThread, good point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some optimization ideas that are orthogonal to PTDS, but the benefit is that it should improve performance for both PTDS and non-PTDS mode.

Comment on lines 141 to 142
stream_free_blocks_[stream_event].insert(blocks.begin(), blocks.end());
stream_free_blocks_.erase(blocks_event);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there is a potential optimization here. We're merging two sorted std::lists and coalescing adjacent blocks. If we ignored the fact that we're coalescing blocks, you could just use std::list::merge .

The current implementation is O(m*n) as it requires doing a linear search of the destination list for every element in the source list. It seems like we should be able to make that be O(m + n).

At the very least, I think we should add a free_list::merge(free_list&& other) function and use that here. Then we can explore doing something more optimal that exploits the fact that both lists are already sorted by pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option would be to just use std::list::merge and ignore coalescing, and then do a second pass through the merged list and coalesce any adjacent blocks. But that operation likely can be fused with a custom merge algorithm.

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved
@harrism harrism marked this pull request as ready for review July 1, 2020 05:44
@harrism harrism requested a review from a team as a code owner July 1, 2020 05:44
@harrism harrism changed the title Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events [REVIEW] Add per-thread-default stream support to pool_memory_resource using thread-local CUDA events Jul 1, 2020
@harrism harrism added 3 - Ready for review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 1, 2020
Copy link
Contributor

@trevorsm7 trevorsm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not finished reviewing; flushing some comments I had so far.

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved

TYPED_TEST(MRTest_mt, AllocFreeDifferentThreadsSameStream)
{
test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be:

Suggested change
test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream);
test_allocate_free_different_threads<TypeParam>(this->mr.get(), this->stream, this->stream);

otherwise streamB will default to 0 and this does not test different threads on the same stream as the test name implies. I think the default arguments for test_allocate_free_different_threads are causing more harm than good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@rongou
Copy link
Contributor

rongou commented Jul 3, 2020

I patched in this PR and ran the TPCH job a bunch of times. I didn't see the occasional memory errors I was getting from CNMeM. Looks solid!

One thing that threw me for a loop was that after I changed the JNI code to use pool_memory_resource, the build complained about not finding librmm. Turns out once we get rid of CNMeM we no longer need to load librmm.so.

@kkraus14
Copy link
Contributor

kkraus14 commented Jul 3, 2020

Turns out once we get rid of CNMeM we no longer need to load librmm.so.

Yup, once CNMeM is gone rmm will move to being a header only C++ library 🎉

Copy link
Contributor

@trevorsm7 trevorsm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a copyright that can be updated. Looks good otherwise.

tests/mr/device/mr_test.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new design inextricably links a single CUDA event to every stream (default or otherwise). However, there are a few places where those two pieces of state (the stream and its event) are allowed to be independent variables. I think it would make the code clearer and safer to keep those things tightly coupled when it's expected that they relate to each other by adding a simple wrapper class for a stream and an event:

struct stream_event{
   cudaStream_t stream;
   cudaEvent_t event;
};

include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved
include/rmm/mr/device/pool_memory_resource.hpp Outdated Show resolved Hide resolved
@harrism harrism merged commit 051435f into rapidsai:branch-0.15 Jul 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add per-thread default stream support to pool_memory_resource
7 participants