Add benchmark for `da.map_overlap` #399

jakirkham · 2020-09-14T19:43:48Z

Includes a benchmark for da.map_overlap on arrays. This is another case where communication is important due to the need to communicate information from neighboring chunks as part of each computation. Should provide us a way to track progress on this front.

cc @pentschev @GenevieveBuckley @quasiben

xref: dask/dask#4803

jakirkham · 2020-09-14T19:46:31Z

Running this on a DGX-1 provides the following results:

Using CuPy:

Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | tcp
Device(s)   | 0
==========================
Wall-clock  | npartitions
--------------------------
3.84 s      | 16
2.50 s      | 16
2.74 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------

Using NumPy:

Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | tcp
Device(s)   | 0
==========================
Wall-clock  | npartitions
--------------------------
4.36 s      | 16
4.37 s      | 16
4.34 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------

codecov-commenter · 2020-09-14T20:04:33Z

Codecov Report

Merging #399 into branch-0.16 will decrease coverage by 3.35%.
The diff coverage is 0.00%.

@@               Coverage Diff               @@
##           branch-0.16     #399      +/-   ##
===============================================
- Coverage        59.74%   56.39%   -3.36%     
===============================================
  Files               17       18       +1     
  Lines             1329     1431     +102     
===============================================
+ Hits               794      807      +13     
- Misses             535      624      +89

Impacted Files	Coverage Δ
dask_cuda/benchmarks/local_cupy_map_overlap.py	`0.00% <0.00%> (ø)`
dask_cuda/device_host_file.py	`98.64% <0.00%> (+0.03%)`	⬆️
dask_cuda/cli/dask_cuda_worker.py	`96.77% <0.00%> (+0.05%)`	⬆️
dask_cuda/initialize.py	`92.59% <0.00%> (+0.28%)`	⬆️
dask_cuda/_version.py	`44.80% <0.00%> (+0.39%)`	⬆️
dask_cuda/is_device_object.py	`88.88% <0.00%> (+3.88%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fa52fe...537f551. Read the comment docs.

pentschev · 2020-09-14T21:21:03Z

I don't see why not, but could you also check CuPy with UCX works @jakirkham ?

pentschev · 2020-09-14T21:22:17Z

Also, you're only using one device on the tests above, we only see UCX benefits when we have -d 0,1,2,3,4,5,6,7, for example.

jakirkham · 2020-09-14T21:55:02Z

Ah ok. Thanks for pointing that out. Tried by running the following:

python dask_cuda/benchmarks/local_cupy_map_overlap.py -d 0,1,2,3,4,5,6,7 -p ucx --enable-nvlink -t <gpu/cpu>

Should add it doesn't appear we are specifying pool size generally when using the RMM pool. Maybe that will cause us some issues with performance.

Using CuPy:

Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | ucx
Device(s)   | 0,1,2,3,4,5,6,7
==========================
Wall-clock  | npartitions
--------------------------
11.38 s     | 16
5.13 s      | 16
5.21 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------
(00,04)     | 8.66 MB/s 8.66 MB/s 8.66 MB/s (50.00 MB)
(01,05)     | 2.42 GB/s 2.42 GB/s 2.42 GB/s (50.00 MB)
(02,04)     | 407.41 MB/s 407.41 MB/s 407.41 MB/s (50.00 MB)
(02,05)     | 137.12 MB/s 137.12 MB/s 137.12 MB/s (50.00 MB)
(03,06)     | 175.29 MB/s 175.29 MB/s 175.29 MB/s (50.00 MB)
(03,07)     | 1.30 GB/s 1.30 GB/s 1.30 GB/s (50.00 MB)
(04,00)     | 407.91 MB/s 407.91 MB/s 407.91 MB/s (50.00 MB)
(05,00)     | 98.13 MB/s 127.34 MB/s 156.54 MB/s (100.00 MB)
(05,01)     | 187.39 MB/s 214.38 MB/s 241.37 MB/s (100.00 MB)
(05,02)     | 95.70 MB/s 107.38 MB/s 119.07 MB/s (100.00 MB)
(05,03)     | 83.18 MB/s 105.36 MB/s 205.68 MB/s (150.00 MB)
(05,04)     | 139.25 MB/s 182.85 MB/s 226.44 MB/s (100.00 MB)
(05,06)     | 84.34 MB/s 158.50 MB/s 213.36 MB/s (150.00 MB)
(05,07)     | 139.66 MB/s 179.33 MB/s 219.00 MB/s (100.00 MB)
(06,00)     | 87.86 MB/s 91.71 MB/s 106.00 MB/s (150.00 MB)
(06,01)     | 88.56 MB/s 100.07 MB/s 138.79 MB/s (200.00 MB)
(06,02)     | 235.20 MB/s 256.63 MB/s 316.94 MB/s (300.00 MB)
(06,03)     | 74.63 MB/s 130.78 MB/s 242.48 MB/s (200.00 MB)
(06,04)     | 102.60 MB/s 112.97 MB/s 144.62 MB/s (150.00 MB)
(06,05)     | 129.47 MB/s 129.74 MB/s 138.84 MB/s (150.00 MB)
(06,07)     | 141.32 MB/s 193.52 MB/s 307.85 MB/s (300.00 MB)
(07,00)     | 9.11 MB/s 9.11 MB/s 9.11 MB/s (50.00 MB)
(07,01)     | 359.07 MB/s 359.07 MB/s 359.07 MB/s (50.00 MB)

Using NumPy:

Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | ucx
Device(s)   | 0,1,2,3,4,5,6,7
==========================
Wall-clock  | npartitions
--------------------------
2.43 s      | 16
2.03 s      | 16
2.03 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------
(00,04)     | 1.46 GB/s 1.46 GB/s 1.46 GB/s (50.00 MB)
(00,06)     | 1.59 GB/s 1.59 GB/s 1.59 GB/s (50.08 MB)
(00,07)     | 1.60 GB/s 1.60 GB/s 1.60 GB/s (50.00 MB)
(01,00)     | 532.86 MB/s 698.01 MB/s 1.03 GB/s (200.08 MB)
(01,02)     | 459.79 MB/s 498.68 MB/s 537.57 MB/s (100.00 MB)
(01,03)     | 451.52 MB/s 490.99 MB/s 530.46 MB/s (100.00 MB)
(01,04)     | 496.55 MB/s 565.75 MB/s 951.16 MB/s (150.00 MB)
(01,05)     | 426.74 MB/s 426.74 MB/s 426.74 MB/s (50.00 MB)
(01,06)     | 452.31 MB/s 483.82 MB/s 515.32 MB/s (100.00 MB)
(01,07)     | 459.77 MB/s 498.58 MB/s 537.38 MB/s (100.00 MB)
(02,03)     | 1.69 GB/s 1.69 GB/s 1.69 GB/s (50.00 MB)
(02,06)     | 1.33 GB/s 1.33 GB/s 1.33 GB/s (50.00 MB)
(02,07)     | 760.93 MB/s 760.93 MB/s 760.93 MB/s (50.00 MB)
(03,00)     | 433.91 MB/s 436.13 MB/s 438.35 MB/s (100.00 MB)
(03,01)     | 425.59 MB/s 431.46 MB/s 437.34 MB/s (100.00 MB)
(03,02)     | 436.09 MB/s 438.24 MB/s 440.38 MB/s (100.00 MB)
(03,04)     | 435.72 MB/s 437.72 MB/s 439.71 MB/s (100.00 MB)
(03,05)     | 436.94 MB/s 438.91 MB/s 440.87 MB/s (100.00 MB)
(03,06)     | 440.83 MB/s 747.13 MB/s 1.33 GB/s (200.00 MB)
(03,07)     | 436.54 MB/s 439.03 MB/s 441.52 MB/s (100.00 MB)
(04,00)     | 435.58 MB/s 435.96 MB/s 436.35 MB/s (100.00 MB)
(04,01)     | 424.26 MB/s 431.20 MB/s 438.15 MB/s (100.00 MB)
(04,02)     | 435.56 MB/s 436.17 MB/s 436.79 MB/s (100.00 MB)
(04,03)     | 436.64 MB/s 436.77 MB/s 436.91 MB/s (100.00 MB)
(04,05)     | 434.97 MB/s 437.67 MB/s 652.00 MB/s (150.00 MB)
(04,06)     | 436.65 MB/s 437.37 MB/s 607.19 MB/s (150.00 MB)
(04,07)     | 435.82 MB/s 436.90 MB/s 437.98 MB/s (100.00 MB)
(05,06)     | 1.15 GB/s 1.19 GB/s 1.22 GB/s (100.00 MB)
(05,07)     | 1.70 GB/s 1.70 GB/s 1.70 GB/s (50.00 MB)
(06,02)     | 2.03 GB/s 2.03 GB/s 2.03 GB/s (50.00 MB)
(06,05)     | 2.17 GB/s 2.17 GB/s 2.17 GB/s (50.00 MB)
(06,07)     | 1.17 GB/s 1.17 GB/s 1.17 GB/s (50.00 MB)
(07,01)     | 2.00 GB/s 2.00 GB/s 2.00 GB/s (50.00 MB)

pentschev · 2020-09-14T22:24:25Z

Performance seems very bad, I guess we still got some work to do for this kind of benchmark.

Should add it doesn't appear we are specifying pool size generally when using the RMM pool. Maybe that will cause us some issues with performance.

Yeah, we're probably using the 50% default, this could be a problem in the first run, but since the second run onwards should have a pool already setup (in the event it needs to be increased), I think numbers will be consistent. Perhaps the ideal would be to let the user specify the pool size.

pentschev

LGTM, thanks @jakirkham !

jakirkham · 2020-09-14T22:28:16Z

Thanks Peter! 😄

Yeah I'm wondering if we need a larger overlap to see a benefit. Playing with that locally to see if that helps.

Ah ok. Good to know. I can take a look at making this configurable.

jakirkham requested a review from a team as a code owner September 14, 2020 19:43

jakirkham mentioned this pull request Sep 14, 2020

Experiment with map_overlap and cupy arrays dask/dask#4803

Closed

jakirkham force-pushed the add_map_overlap_bench branch 2 times, most recently from 5e8871a to ba38f68 Compare September 14, 2020 19:57

Add benchmark for da.map_overlap

537f551

jakirkham force-pushed the add_map_overlap_bench branch from ba38f68 to 537f551 Compare September 14, 2020 19:58

jakirkham requested a review from pentschev September 14, 2020 20:02

pentschev approved these changes Sep 14, 2020

View reviewed changes

pentschev merged commit 8d42f27 into rapidsai:branch-0.16 Sep 14, 2020

jakirkham deleted the add_map_overlap_bench branch September 14, 2020 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark for `da.map_overlap` #399

Add benchmark for `da.map_overlap` #399

jakirkham commented Sep 14, 2020

jakirkham commented Sep 14, 2020

codecov-commenter commented Sep 14, 2020 •

edited

Loading

pentschev commented Sep 14, 2020

pentschev commented Sep 14, 2020

jakirkham commented Sep 14, 2020 •

edited

Loading

pentschev commented Sep 14, 2020

pentschev left a comment

jakirkham commented Sep 14, 2020

Add benchmark for da.map_overlap #399

Add benchmark for da.map_overlap #399

Conversation

jakirkham commented Sep 14, 2020

jakirkham commented Sep 14, 2020

codecov-commenter commented Sep 14, 2020 • edited Loading

Codecov Report

pentschev commented Sep 14, 2020

pentschev commented Sep 14, 2020

jakirkham commented Sep 14, 2020 • edited Loading

pentschev commented Sep 14, 2020

pentschev left a comment

Choose a reason for hiding this comment

jakirkham commented Sep 14, 2020

Add benchmark for `da.map_overlap` #399

Add benchmark for `da.map_overlap` #399

codecov-commenter commented Sep 14, 2020 •

edited

Loading

jakirkham commented Sep 14, 2020 •

edited

Loading