Add option to track RMM allocations #842

shwina · 2022-01-31T20:53:01Z

Adds the rmm_track_allocations option that enables workers to query the amount of RMM memory allocated at any time via mr.get_allocated_bytes().

This is used in dask/distributed#5740.

pentschev

That looks really cool, thanks @shwina . Could we add a test as well? We have some in

dask-cuda/dask_cuda/tests/test_local_cuda_cluster.py

Lines 140 to 202 in b60dec9

    
           @gen_test(timeout=20) 
        
           async def test_rmm_pool(): 
        
               rmm = pytest.importorskip("rmm") 
        
               async with LocalCUDACluster(rmm_pool_size="2GB", asynchronous=True,) as cluster: 
        
                   async with Client(cluster, asynchronous=True) as client: 
        
                       memory_resource_type = await client.run( 
        
                           rmm.mr.get_current_device_resource_type 
        
                       ) 
        
                       for v in memory_resource_type.values(): 
        
                           assert v is rmm.mr.PoolMemoryResource 
        
           @gen_test(timeout=20) 
        
           async def test_rmm_maximum_poolsize_without_poolsize_error(): 
        
               pytest.importorskip("rmm") 
        
               with pytest.raises(ValueError): 
        
                   await LocalCUDACluster(rmm_maximum_pool_size="2GB", asynchronous=True) 
        
           @gen_test(timeout=20) 
        
           async def test_rmm_managed(): 
        
               rmm = pytest.importorskip("rmm") 
        
               async with LocalCUDACluster(rmm_managed_memory=True, asynchronous=True,) as cluster: 
        
                   async with Client(cluster, asynchronous=True) as client: 
        
                       memory_resource_type = await client.run( 
        
                           rmm.mr.get_current_device_resource_type 
        
                       ) 
        
                       for v in memory_resource_type.values(): 
        
                           assert v is rmm.mr.ManagedMemoryResource 
        
           @pytest.mark.skipif( 
        
               _driver_version < 11020 or _runtime_version < 11020, 
        
               reason="cudaMallocAsync not supported", 
        
           ) 
        
           @gen_test(timeout=20) 
        
           async def test_rmm_async(): 
        
               rmm = pytest.importorskip("rmm") 
        
               async with LocalCUDACluster(rmm_async=True, asynchronous=True,) as cluster: 
        
                   async with Client(cluster, asynchronous=True) as client: 
        
                       memory_resource_type = await client.run( 
        
                           rmm.mr.get_current_device_resource_type 
        
                       ) 
        
                       for v in memory_resource_type.values(): 
        
                           assert v is rmm.mr.CudaAsyncMemoryResource 
        
           @gen_test(timeout=20) 
        
           async def test_rmm_logging(): 
        
               rmm = pytest.importorskip("rmm") 
        
               async with LocalCUDACluster( 
        
                   rmm_pool_size="2GB", rmm_log_directory=".", asynchronous=True, 
        
               ) as cluster: 
        
                   async with Client(cluster, asynchronous=True) as client: 
        
                       memory_resource_type = await client.run( 
        
                           rmm.mr.get_current_device_resource_type 
        
                       ) 
        
                       for v in memory_resource_type.values(): 
        
                           assert v is rmm.mr.LoggingResourceAdaptor

that can be used as reference.

pentschev

@shwina it seems that there are some style issues, I suggest using pre-commit to automatically resolve them at commit time.

…nto rmm-track-allocations

codecov-commenter · 2022-02-17T15:53:25Z

Codecov Report

Merging #842 (4da1fc9) into branch-22.04 (06decc6) will decrease coverage by 24.55%.
The diff coverage is 100.00%.

@@                Coverage Diff                @@
##           branch-22.04     #842       +/-   ##
=================================================
- Coverage         89.55%   65.00%   -24.56%     
=================================================
  Files                16       22        +6     
  Lines              2078     3066      +988     
=================================================
+ Hits               1861     1993      +132     
- Misses              217     1073      +856

Impacted Files	Coverage Δ
dask_cuda/cuda_worker.py	`76.54% <ø> (ø)`
dask_cuda/cli/dask_cuda_worker.py	`97.26% <100.00%> (+0.03%)`	⬆️
dask_cuda/local_cuda_cluster.py	`88.17% <100.00%> (+0.12%)`	⬆️
dask_cuda/utils.py	`89.75% <100.00%> (+0.21%)`	⬆️
dask_cuda/benchmarks/local_cupy.py	`0.00% <0.00%> (ø)`
dask_cuda/_version.py	`44.80% <0.00%> (ø)`
dask_cuda/benchmarks/utils.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cupy_map_overlap.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cudf_merge.py	`0.00% <0.00%> (ø)`
dask_cuda/benchmarks/local_cudf_shuffle.py	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06decc6...4da1fc9. Read the comment docs.

shwina · 2022-02-17T15:58:28Z

@pentschev I added a test and resolved the style issues - could you please take a look?

pentschev

@shwina thanks for addressing those. I have a couple of minor requests, shouldn't be too much work, but otherwise looks great!

pentschev · 2022-02-17T17:33:19Z

dask_cuda/local_cuda_cluster.py

@@ -210,6 +210,7 @@ def __init__(
        rmm_managed_memory=False,
        rmm_async=False,
        rmm_log_directory=None,
+        rmm_track_allocations=False,


Could you also add docstrings for the new parameter in here?

pentschev · 2022-02-17T17:40:01Z

dask_cuda/tests/test_dask_cuda_worker.py

+def test_rmm_track_allocations(loop):  # noqa: F811
+    rmm = pytest.importorskip("rmm")
+    with popen(["dask-scheduler", "--port", "9369", "--no-dashboard"]):
+        with popen(
+            [
+                "dask-cuda-worker",
+                "127.0.0.1:9369",
+                "--host",
+                "127.0.0.1",
+                "--rmm-pool-size",
+                "2 GB",
+                "--no-dashboard",
+                "--rmm-track-allocations",
+            ]
+        ):
+            with Client("127.0.0.1:9369", loop=loop) as client:
+                assert wait_workers(client, n_gpus=get_n_gpus())
+
+                memory_resource_type = client.run(
+                    rmm.mr.get_current_device_resource_type
+                )
+                for v in memory_resource_type.values():
+                    assert v is rmm.mr.TrackingResourceAdaptor
+
+                memory_resource_upstream_type = client.run(
+                    lambda: type(rmm.mr.get_current_device_resource().upstream_mr)
+                )
+                for v in memory_resource_upstream_type.values():
+                    assert v is rmm.mr.PoolMemoryResource


Given dask-cuda-worker and LocalCUDACluster have somewhat different control paths, could you add the same test for LocalCUDACluster in https://github.com/rapidsai/dask-cuda/blob/branch-22.04/dask_cuda/tests/test_local_cuda_cluster.py ? Should be fairly straightforward, the inner logic will be the same, just the cluster setup will be slightly different.

Thanks! I added that.

pentschev · 2022-02-17T18:43:39Z

Errors are unrelated, opened #861 to track.

pentschev

LGTM. There's the failing test as reported in #861 which blocks this PR, if this is not critical to go in immediately, let's wait for that to be fixed, otherwise we may need to xfail that test temporarily.

Thanks @shwina !

pentschev · 2022-02-17T22:13:10Z

Actually there are some style issues too @shwina , could you fix them? I would strongly suggest using pre-commit to avoid this hassle for every new commit. 😄

…llocations

charlesbluca · 2022-02-23T20:46:25Z

dask_cuda/utils.py


    def setup(self, worker=None):
+        import rmm


Are we safe to always import RMM here? If this function is always run as part of the RMMSetup plugin, wouldn't that make RMM an implicit dependency for starting a CUDA cluster?

Great catch @charlesbluca , indeed I failed to realize that. Yes, we should remove it from the "main" context of setup and leave it solely within the local contexts that exist today, iff one of those options to enable RMM is set by the user.

Thanks both! I've accepted @pentschev's suggestions here.

dask_cuda/utils.py

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

pentschev

Forgot to add back the previously removed RMM imports.

dask_cuda/utils.py

pentschev · 2022-02-24T14:25:56Z

LGTM now, thanks @shwina for the work here, and @charlesbluca for the review and fixes here too!

pentschev · 2022-02-24T14:26:01Z

@gpucibot merge

pentschev · 2022-02-24T17:31:14Z

rerun tests

pentschev · 2022-02-25T09:28:45Z

rerun tests

pentschev · 2022-02-25T20:04:28Z

rerun tests

pentschev · 2022-02-25T20:31:21Z

@gpucibot merge

shwina added 3 commits January 31, 2022 14:23

Add an option to track RMM allocations

0c6766e

Remove global import

ad4f2c0

Add CLI argument

cdde589

shwina requested a review from a team as a code owner January 31, 2022 20:53

github-actions bot added the python python code needed label Jan 31, 2022

shwina changed the title ~~Rmm track allocations~~ Add option to track RMM allocations Jan 31, 2022

pentschev reviewed Jan 31, 2022

View reviewed changes

pentschev added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 31, 2022

pentschev reviewed Jan 31, 2022

View reviewed changes

Run pre-commit hooks

e63f3eb

shwina added 3 commits February 17, 2022 10:49

Test that a TrackingResourceAdaptor is used with rmm_track_allocations

ad66c14

Merge branch 'rmm-track-allocations' of github.com:shwina/dask-cuda i…

b63abb7

…nto rmm-track-allocations

Style?

c4fbd2a

Style

a5c5a8b

pentschev requested changes Feb 17, 2022

View reviewed changes

Add a test and docs

ea75dea

pentschev approved these changes Feb 17, 2022

View reviewed changes

charlesbluca added 2 commits February 23, 2022 12:25

Merge remote-tracking branch 'upstream/branch-22.04' into rmm-track-a…

9196be4

…llocations

Run pre-commit hooks

6106813

charlesbluca reviewed Feb 23, 2022

View reviewed changes

pentschev requested changes Feb 24, 2022

View reviewed changes

dask_cuda/utils.py Outdated Show resolved Hide resolved

dask_cuda/utils.py Show resolved Hide resolved

shwina and others added 2 commits February 24, 2022 08:25

Update dask_cuda/utils.py

085910a

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

Update dask_cuda/utils.py

d2c3771

Co-authored-by: Peter Andreas Entschev <peter@entschev.com>

pentschev reviewed Feb 24, 2022

View reviewed changes

dask_cuda/utils.py Show resolved Hide resolved

dask_cuda/utils.py Outdated Show resolved Hide resolved

Add RMM imports back

4da1fc9

pentschev approved these changes Feb 24, 2022

View reviewed changes

rapids-bot bot merged commit f09c7c5 into rapidsai:branch-22.04 Feb 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to track RMM allocations #842

Add option to track RMM allocations #842

shwina commented Jan 31, 2022

pentschev left a comment

pentschev left a comment

codecov-commenter commented Feb 17, 2022 •

edited

Loading

shwina commented Feb 17, 2022

pentschev left a comment

pentschev Feb 17, 2022

shwina Feb 17, 2022

pentschev Feb 17, 2022

shwina Feb 17, 2022

pentschev commented Feb 17, 2022

pentschev left a comment

pentschev commented Feb 17, 2022

charlesbluca Feb 23, 2022

pentschev Feb 24, 2022

shwina Feb 24, 2022

pentschev left a comment

pentschev commented Feb 24, 2022

pentschev commented Feb 24, 2022

pentschev commented Feb 24, 2022

pentschev commented Feb 25, 2022

pentschev commented Feb 25, 2022

pentschev commented Feb 25, 2022

	@gen_test(timeout=20)
	async def test_rmm_pool():
	rmm = pytest.importorskip("rmm")

	async with LocalCUDACluster(rmm_pool_size="2GB", asynchronous=True,) as cluster:
	async with Client(cluster, asynchronous=True) as client:
	memory_resource_type = await client.run(
	rmm.mr.get_current_device_resource_type
	)
	for v in memory_resource_type.values():
	assert v is rmm.mr.PoolMemoryResource


	@gen_test(timeout=20)
	async def test_rmm_maximum_poolsize_without_poolsize_error():
	pytest.importorskip("rmm")
	with pytest.raises(ValueError):
	await LocalCUDACluster(rmm_maximum_pool_size="2GB", asynchronous=True)


	@gen_test(timeout=20)
	async def test_rmm_managed():
	rmm = pytest.importorskip("rmm")

	async with LocalCUDACluster(rmm_managed_memory=True, asynchronous=True,) as cluster:
	async with Client(cluster, asynchronous=True) as client:
	memory_resource_type = await client.run(
	rmm.mr.get_current_device_resource_type
	)
	for v in memory_resource_type.values():
	assert v is rmm.mr.ManagedMemoryResource


	@pytest.mark.skipif(
	_driver_version < 11020 or _runtime_version < 11020,
	reason="cudaMallocAsync not supported",
	)
	@gen_test(timeout=20)
	async def test_rmm_async():
	rmm = pytest.importorskip("rmm")

	async with LocalCUDACluster(rmm_async=True, asynchronous=True,) as cluster:
	async with Client(cluster, asynchronous=True) as client:
	memory_resource_type = await client.run(
	rmm.mr.get_current_device_resource_type
	)
	for v in memory_resource_type.values():
	assert v is rmm.mr.CudaAsyncMemoryResource


	@gen_test(timeout=20)
	async def test_rmm_logging():
	rmm = pytest.importorskip("rmm")

	async with LocalCUDACluster(
	rmm_pool_size="2GB", rmm_log_directory=".", asynchronous=True,
	) as cluster:
	async with Client(cluster, asynchronous=True) as client:
	memory_resource_type = await client.run(
	rmm.mr.get_current_device_resource_type
	)
	for v in memory_resource_type.values():
	assert v is rmm.mr.LoggingResourceAdaptor

Add option to track RMM allocations #842

Add option to track RMM allocations #842

Conversation

shwina commented Jan 31, 2022

pentschev left a comment

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 17, 2022 • edited Loading

Codecov Report

shwina commented Feb 17, 2022

pentschev left a comment

Choose a reason for hiding this comment

pentschev Feb 17, 2022

Choose a reason for hiding this comment

shwina Feb 17, 2022

Choose a reason for hiding this comment

pentschev Feb 17, 2022

Choose a reason for hiding this comment

shwina Feb 17, 2022

Choose a reason for hiding this comment

pentschev commented Feb 17, 2022

pentschev left a comment

Choose a reason for hiding this comment

pentschev commented Feb 17, 2022

charlesbluca Feb 23, 2022

Choose a reason for hiding this comment

pentschev Feb 24, 2022

Choose a reason for hiding this comment

shwina Feb 24, 2022

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

pentschev commented Feb 24, 2022

pentschev commented Feb 24, 2022

pentschev commented Feb 24, 2022

pentschev commented Feb 25, 2022

pentschev commented Feb 25, 2022

pentschev commented Feb 25, 2022

codecov-commenter commented Feb 17, 2022 •

edited

Loading