Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to track RMM allocations #842

Merged
merged 14 commits into from
Feb 25, 2022

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Jan 31, 2022

Adds the rmm_track_allocations option that enables workers to query the amount of RMM memory allocated at any time via mr.get_allocated_bytes().

This is used in dask/distributed#5740.

@shwina shwina requested a review from a team as a code owner January 31, 2022 20:53
@github-actions github-actions bot added the python python code needed label Jan 31, 2022
@shwina shwina changed the title Rmm track allocations Add option to track RMM allocations Jan 31, 2022
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks really cool, thanks @shwina . Could we add a test as well? We have some in

@gen_test(timeout=20)
async def test_rmm_pool():
rmm = pytest.importorskip("rmm")
async with LocalCUDACluster(rmm_pool_size="2GB", asynchronous=True,) as cluster:
async with Client(cluster, asynchronous=True) as client:
memory_resource_type = await client.run(
rmm.mr.get_current_device_resource_type
)
for v in memory_resource_type.values():
assert v is rmm.mr.PoolMemoryResource
@gen_test(timeout=20)
async def test_rmm_maximum_poolsize_without_poolsize_error():
pytest.importorskip("rmm")
with pytest.raises(ValueError):
await LocalCUDACluster(rmm_maximum_pool_size="2GB", asynchronous=True)
@gen_test(timeout=20)
async def test_rmm_managed():
rmm = pytest.importorskip("rmm")
async with LocalCUDACluster(rmm_managed_memory=True, asynchronous=True,) as cluster:
async with Client(cluster, asynchronous=True) as client:
memory_resource_type = await client.run(
rmm.mr.get_current_device_resource_type
)
for v in memory_resource_type.values():
assert v is rmm.mr.ManagedMemoryResource
@pytest.mark.skipif(
_driver_version < 11020 or _runtime_version < 11020,
reason="cudaMallocAsync not supported",
)
@gen_test(timeout=20)
async def test_rmm_async():
rmm = pytest.importorskip("rmm")
async with LocalCUDACluster(rmm_async=True, asynchronous=True,) as cluster:
async with Client(cluster, asynchronous=True) as client:
memory_resource_type = await client.run(
rmm.mr.get_current_device_resource_type
)
for v in memory_resource_type.values():
assert v is rmm.mr.CudaAsyncMemoryResource
@gen_test(timeout=20)
async def test_rmm_logging():
rmm = pytest.importorskip("rmm")
async with LocalCUDACluster(
rmm_pool_size="2GB", rmm_log_directory=".", asynchronous=True,
) as cluster:
async with Client(cluster, asynchronous=True) as client:
memory_resource_type = await client.run(
rmm.mr.get_current_device_resource_type
)
for v in memory_resource_type.values():
assert v is rmm.mr.LoggingResourceAdaptor
that can be used as reference.

@pentschev pentschev added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 31, 2022
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina it seems that there are some style issues, I suggest using pre-commit to automatically resolve them at commit time.

@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2022

Codecov Report

Merging #842 (4da1fc9) into branch-22.04 (06decc6) will decrease coverage by 24.55%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                Coverage Diff                @@
##           branch-22.04     #842       +/-   ##
=================================================
- Coverage         89.55%   65.00%   -24.56%     
=================================================
  Files                16       22        +6     
  Lines              2078     3066      +988     
=================================================
+ Hits               1861     1993      +132     
- Misses              217     1073      +856     
Impacted Files Coverage Δ
dask_cuda/cuda_worker.py 76.54% <ø> (ø)
dask_cuda/cli/dask_cuda_worker.py 97.26% <100.00%> (+0.03%) ⬆️
dask_cuda/local_cuda_cluster.py 88.17% <100.00%> (+0.12%) ⬆️
dask_cuda/utils.py 89.75% <100.00%> (+0.21%) ⬆️
dask_cuda/benchmarks/local_cupy.py 0.00% <0.00%> (ø)
dask_cuda/_version.py 44.80% <0.00%> (ø)
dask_cuda/benchmarks/utils.py 0.00% <0.00%> (ø)
dask_cuda/benchmarks/local_cupy_map_overlap.py 0.00% <0.00%> (ø)
dask_cuda/benchmarks/local_cudf_merge.py 0.00% <0.00%> (ø)
dask_cuda/benchmarks/local_cudf_shuffle.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06decc6...4da1fc9. Read the comment docs.

@shwina
Copy link
Contributor Author

shwina commented Feb 17, 2022

@pentschev I added a test and resolved the style issues - could you please take a look?

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina thanks for addressing those. I have a couple of minor requests, shouldn't be too much work, but otherwise looks great!

@@ -210,6 +210,7 @@ def __init__(
rmm_managed_memory=False,
rmm_async=False,
rmm_log_directory=None,
rmm_track_allocations=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add docstrings for the new parameter in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines +258 to +286
def test_rmm_track_allocations(loop): # noqa: F811
rmm = pytest.importorskip("rmm")
with popen(["dask-scheduler", "--port", "9369", "--no-dashboard"]):
with popen(
[
"dask-cuda-worker",
"127.0.0.1:9369",
"--host",
"127.0.0.1",
"--rmm-pool-size",
"2 GB",
"--no-dashboard",
"--rmm-track-allocations",
]
):
with Client("127.0.0.1:9369", loop=loop) as client:
assert wait_workers(client, n_gpus=get_n_gpus())

memory_resource_type = client.run(
rmm.mr.get_current_device_resource_type
)
for v in memory_resource_type.values():
assert v is rmm.mr.TrackingResourceAdaptor

memory_resource_upstream_type = client.run(
lambda: type(rmm.mr.get_current_device_resource().upstream_mr)
)
for v in memory_resource_upstream_type.values():
assert v is rmm.mr.PoolMemoryResource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given dask-cuda-worker and LocalCUDACluster have somewhat different control paths, could you add the same test for LocalCUDACluster in https://github.com/rapidsai/dask-cuda/blob/branch-22.04/dask_cuda/tests/test_local_cuda_cluster.py ? Should be fairly straightforward, the inner logic will be the same, just the cluster setup will be slightly different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added that.

@pentschev
Copy link
Member

Errors are unrelated, opened #861 to track.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There's the failing test as reported in #861 which blocks this PR, if this is not critical to go in immediately, let's wait for that to be fixed, otherwise we may need to xfail that test temporarily.

Thanks @shwina !

@pentschev
Copy link
Member

Actually there are some style issues too @shwina , could you fix them? I would strongly suggest using pre-commit to avoid this hassle for every new commit. 😄


def setup(self, worker=None):
import rmm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we safe to always import RMM here? If this function is always run as part of the RMMSetup plugin, wouldn't that make RMM an implicit dependency for starting a CUDA cluster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @charlesbluca , indeed I failed to realize that. Yes, we should remove it from the "main" context of setup and leave it solely within the local contexts that exist today, iff one of those options to enable RMM is set by the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both! I've accepted @pentschev's suggestions here.

dask_cuda/utils.py Outdated Show resolved Hide resolved
dask_cuda/utils.py Show resolved Hide resolved
shwina and others added 2 commits February 24, 2022 08:25
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to add back the previously removed RMM imports.

dask_cuda/utils.py Show resolved Hide resolved
dask_cuda/utils.py Outdated Show resolved Hide resolved
@pentschev
Copy link
Member

LGTM now, thanks @shwina for the work here, and @charlesbluca for the review and fixes here too!

@pentschev
Copy link
Member

@gpucibot merge

@pentschev
Copy link
Member

rerun tests

2 similar comments
@pentschev
Copy link
Member

rerun tests

@pentschev
Copy link
Member

rerun tests

@pentschev
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit f09c7c5 into rapidsai:branch-22.04 Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants