-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to track RMM allocations #842
Add option to track RMM allocations #842
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks really cool, thanks @shwina . Could we add a test as well? We have some in
dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py
Lines 140 to 202 in b60dec9
@gen_test(timeout=20) | |
async def test_rmm_pool(): | |
rmm = pytest.importorskip("rmm") | |
async with LocalCUDACluster(rmm_pool_size="2GB", asynchronous=True,) as cluster: | |
async with Client(cluster, asynchronous=True) as client: | |
memory_resource_type = await client.run( | |
rmm.mr.get_current_device_resource_type | |
) | |
for v in memory_resource_type.values(): | |
assert v is rmm.mr.PoolMemoryResource | |
@gen_test(timeout=20) | |
async def test_rmm_maximum_poolsize_without_poolsize_error(): | |
pytest.importorskip("rmm") | |
with pytest.raises(ValueError): | |
await LocalCUDACluster(rmm_maximum_pool_size="2GB", asynchronous=True) | |
@gen_test(timeout=20) | |
async def test_rmm_managed(): | |
rmm = pytest.importorskip("rmm") | |
async with LocalCUDACluster(rmm_managed_memory=True, asynchronous=True,) as cluster: | |
async with Client(cluster, asynchronous=True) as client: | |
memory_resource_type = await client.run( | |
rmm.mr.get_current_device_resource_type | |
) | |
for v in memory_resource_type.values(): | |
assert v is rmm.mr.ManagedMemoryResource | |
@pytest.mark.skipif( | |
_driver_version < 11020 or _runtime_version < 11020, | |
reason="cudaMallocAsync not supported", | |
) | |
@gen_test(timeout=20) | |
async def test_rmm_async(): | |
rmm = pytest.importorskip("rmm") | |
async with LocalCUDACluster(rmm_async=True, asynchronous=True,) as cluster: | |
async with Client(cluster, asynchronous=True) as client: | |
memory_resource_type = await client.run( | |
rmm.mr.get_current_device_resource_type | |
) | |
for v in memory_resource_type.values(): | |
assert v is rmm.mr.CudaAsyncMemoryResource | |
@gen_test(timeout=20) | |
async def test_rmm_logging(): | |
rmm = pytest.importorskip("rmm") | |
async with LocalCUDACluster( | |
rmm_pool_size="2GB", rmm_log_directory=".", asynchronous=True, | |
) as cluster: | |
async with Client(cluster, asynchronous=True) as client: | |
memory_resource_type = await client.run( | |
rmm.mr.get_current_device_resource_type | |
) | |
for v in memory_resource_type.values(): | |
assert v is rmm.mr.LoggingResourceAdaptor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwina it seems that there are some style issues, I suggest using pre-commit
to automatically resolve them at commit time.
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #842 +/- ##
=================================================
- Coverage 89.55% 65.00% -24.56%
=================================================
Files 16 22 +6
Lines 2078 3066 +988
=================================================
+ Hits 1861 1993 +132
- Misses 217 1073 +856
Continue to review full report at Codecov.
|
@pentschev I added a test and resolved the style issues - could you please take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shwina thanks for addressing those. I have a couple of minor requests, shouldn't be too much work, but otherwise looks great!
@@ -210,6 +210,7 @@ def __init__( | |||
rmm_managed_memory=False, | |||
rmm_async=False, | |||
rmm_log_directory=None, | |||
rmm_track_allocations=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add docstrings for the new parameter in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
def test_rmm_track_allocations(loop): # noqa: F811 | ||
rmm = pytest.importorskip("rmm") | ||
with popen(["dask-scheduler", "--port", "9369", "--no-dashboard"]): | ||
with popen( | ||
[ | ||
"dask-cuda-worker", | ||
"127.0.0.1:9369", | ||
"--host", | ||
"127.0.0.1", | ||
"--rmm-pool-size", | ||
"2 GB", | ||
"--no-dashboard", | ||
"--rmm-track-allocations", | ||
] | ||
): | ||
with Client("127.0.0.1:9369", loop=loop) as client: | ||
assert wait_workers(client, n_gpus=get_n_gpus()) | ||
|
||
memory_resource_type = client.run( | ||
rmm.mr.get_current_device_resource_type | ||
) | ||
for v in memory_resource_type.values(): | ||
assert v is rmm.mr.TrackingResourceAdaptor | ||
|
||
memory_resource_upstream_type = client.run( | ||
lambda: type(rmm.mr.get_current_device_resource().upstream_mr) | ||
) | ||
for v in memory_resource_upstream_type.values(): | ||
assert v is rmm.mr.PoolMemoryResource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given dask-cuda-worker
and LocalCUDACluster
have somewhat different control paths, could you add the same test for LocalCUDACluster
in https://github.com/rapidsai/dask-cuda/blob/branch-22.04/dask_cuda/tests/test_local_cuda_cluster.py ? Should be fairly straightforward, the inner logic will be the same, just the cluster setup will be slightly different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I added that.
Errors are unrelated, opened #861 to track. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there are some style issues too @shwina , could you fix them? I would strongly suggest using |
dask_cuda/utils.py
Outdated
|
||
def setup(self, worker=None): | ||
import rmm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we safe to always import RMM here? If this function is always run as part of the RMMSetup
plugin, wouldn't that make RMM an implicit dependency for starting a CUDA cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch @charlesbluca , indeed I failed to realize that. Yes, we should remove it from the "main" context of setup
and leave it solely within the local contexts that exist today, iff one of those options to enable RMM is set by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both! I've accepted @pentschev's suggestions here.
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to add back the previously removed RMM imports.
LGTM now, thanks @shwina for the work here, and @charlesbluca for the review and fixes here too! |
@gpucibot merge |
rerun tests |
2 similar comments
rerun tests |
rerun tests |
@gpucibot merge |
Adds the
rmm_track_allocations
option that enables workers to query the amount of RMM memory allocated at any time viamr.get_allocated_bytes()
.This is used in dask/distributed#5740.