Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark for da.map_overlap #399

Merged
merged 1 commit into from
Sep 14, 2020

Conversation

jakirkham
Copy link
Member

Includes a benchmark for da.map_overlap on arrays. This is another case where communication is important due to the need to communicate information from neighboring chunks as part of each computation. Should provide us a way to track progress on this front.

cc @pentschev @GenevieveBuckley @quasiben

xref: dask/dask#4803

@jakirkham jakirkham requested a review from a team as a code owner September 14, 2020 19:43
@jakirkham
Copy link
Member Author

Running this on a DGX-1 provides the following results:

Using CuPy:
Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | tcp
Device(s)   | 0
==========================
Wall-clock  | npartitions
--------------------------
3.84 s      | 16
2.50 s      | 16
2.74 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------
Using NumPy:
Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | tcp
Device(s)   | 0
==========================
Wall-clock  | npartitions
--------------------------
4.36 s      | 16
4.37 s      | 16
4.34 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------

@jakirkham jakirkham force-pushed the add_map_overlap_bench branch 2 times, most recently from 5e8871a to ba38f68 Compare September 14, 2020 19:57
@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2020

Codecov Report

Merging #399 into branch-0.16 will decrease coverage by 3.35%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.16     #399      +/-   ##
===============================================
- Coverage        59.74%   56.39%   -3.36%     
===============================================
  Files               17       18       +1     
  Lines             1329     1431     +102     
===============================================
+ Hits               794      807      +13     
- Misses             535      624      +89     
Impacted Files Coverage Δ
dask_cuda/benchmarks/local_cupy_map_overlap.py 0.00% <0.00%> (ø)
dask_cuda/device_host_file.py 98.64% <0.00%> (+0.03%) ⬆️
dask_cuda/cli/dask_cuda_worker.py 96.77% <0.00%> (+0.05%) ⬆️
dask_cuda/initialize.py 92.59% <0.00%> (+0.28%) ⬆️
dask_cuda/_version.py 44.80% <0.00%> (+0.39%) ⬆️
dask_cuda/is_device_object.py 88.88% <0.00%> (+3.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2fa52fe...537f551. Read the comment docs.

@pentschev
Copy link
Member

I don't see why not, but could you also check CuPy with UCX works @jakirkham ?

@pentschev
Copy link
Member

Also, you're only using one device on the tests above, we only see UCX benefits when we have -d 0,1,2,3,4,5,6,7, for example.

@jakirkham
Copy link
Member Author

jakirkham commented Sep 14, 2020

Ah ok. Thanks for pointing that out. Tried by running the following:

python dask_cuda/benchmarks/local_cupy_map_overlap.py -d 0,1,2,3,4,5,6,7 -p ucx --enable-nvlink -t <gpu/cpu>

Should add it doesn't appear we are specifying pool size generally when using the RMM pool. Maybe that will cause us some issues with performance.

Using CuPy:
Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | ucx
Device(s)   | 0,1,2,3,4,5,6,7
==========================
Wall-clock  | npartitions
--------------------------
11.38 s     | 16
5.13 s      | 16
5.21 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------
(00,04)     | 8.66 MB/s 8.66 MB/s 8.66 MB/s (50.00 MB)
(01,05)     | 2.42 GB/s 2.42 GB/s 2.42 GB/s (50.00 MB)
(02,04)     | 407.41 MB/s 407.41 MB/s 407.41 MB/s (50.00 MB)
(02,05)     | 137.12 MB/s 137.12 MB/s 137.12 MB/s (50.00 MB)
(03,06)     | 175.29 MB/s 175.29 MB/s 175.29 MB/s (50.00 MB)
(03,07)     | 1.30 GB/s 1.30 GB/s 1.30 GB/s (50.00 MB)
(04,00)     | 407.91 MB/s 407.91 MB/s 407.91 MB/s (50.00 MB)
(05,00)     | 98.13 MB/s 127.34 MB/s 156.54 MB/s (100.00 MB)
(05,01)     | 187.39 MB/s 214.38 MB/s 241.37 MB/s (100.00 MB)
(05,02)     | 95.70 MB/s 107.38 MB/s 119.07 MB/s (100.00 MB)
(05,03)     | 83.18 MB/s 105.36 MB/s 205.68 MB/s (150.00 MB)
(05,04)     | 139.25 MB/s 182.85 MB/s 226.44 MB/s (100.00 MB)
(05,06)     | 84.34 MB/s 158.50 MB/s 213.36 MB/s (150.00 MB)
(05,07)     | 139.66 MB/s 179.33 MB/s 219.00 MB/s (100.00 MB)
(06,00)     | 87.86 MB/s 91.71 MB/s 106.00 MB/s (150.00 MB)
(06,01)     | 88.56 MB/s 100.07 MB/s 138.79 MB/s (200.00 MB)
(06,02)     | 235.20 MB/s 256.63 MB/s 316.94 MB/s (300.00 MB)
(06,03)     | 74.63 MB/s 130.78 MB/s 242.48 MB/s (200.00 MB)
(06,04)     | 102.60 MB/s 112.97 MB/s 144.62 MB/s (150.00 MB)
(06,05)     | 129.47 MB/s 129.74 MB/s 138.84 MB/s (150.00 MB)
(06,07)     | 141.32 MB/s 193.52 MB/s 307.85 MB/s (300.00 MB)
(07,00)     | 9.11 MB/s 9.11 MB/s 9.11 MB/s (50.00 MB)
(07,01)     | 359.07 MB/s 359.07 MB/s 359.07 MB/s (50.00 MB)
Using NumPy:
Roundtrip benchmark
--------------------------
Size        | 10000*10000
Chunk-size  | 128 MiB
Ignore-size | 1.05 MB
Protocol    | ucx
Device(s)   | 0,1,2,3,4,5,6,7
==========================
Wall-clock  | npartitions
--------------------------
2.43 s      | 16
2.03 s      | 16
2.03 s      | 16
==========================
(w1,w2)     | 25% 50% 75% (total nbytes)
--------------------------
(00,04)     | 1.46 GB/s 1.46 GB/s 1.46 GB/s (50.00 MB)
(00,06)     | 1.59 GB/s 1.59 GB/s 1.59 GB/s (50.08 MB)
(00,07)     | 1.60 GB/s 1.60 GB/s 1.60 GB/s (50.00 MB)
(01,00)     | 532.86 MB/s 698.01 MB/s 1.03 GB/s (200.08 MB)
(01,02)     | 459.79 MB/s 498.68 MB/s 537.57 MB/s (100.00 MB)
(01,03)     | 451.52 MB/s 490.99 MB/s 530.46 MB/s (100.00 MB)
(01,04)     | 496.55 MB/s 565.75 MB/s 951.16 MB/s (150.00 MB)
(01,05)     | 426.74 MB/s 426.74 MB/s 426.74 MB/s (50.00 MB)
(01,06)     | 452.31 MB/s 483.82 MB/s 515.32 MB/s (100.00 MB)
(01,07)     | 459.77 MB/s 498.58 MB/s 537.38 MB/s (100.00 MB)
(02,03)     | 1.69 GB/s 1.69 GB/s 1.69 GB/s (50.00 MB)
(02,06)     | 1.33 GB/s 1.33 GB/s 1.33 GB/s (50.00 MB)
(02,07)     | 760.93 MB/s 760.93 MB/s 760.93 MB/s (50.00 MB)
(03,00)     | 433.91 MB/s 436.13 MB/s 438.35 MB/s (100.00 MB)
(03,01)     | 425.59 MB/s 431.46 MB/s 437.34 MB/s (100.00 MB)
(03,02)     | 436.09 MB/s 438.24 MB/s 440.38 MB/s (100.00 MB)
(03,04)     | 435.72 MB/s 437.72 MB/s 439.71 MB/s (100.00 MB)
(03,05)     | 436.94 MB/s 438.91 MB/s 440.87 MB/s (100.00 MB)
(03,06)     | 440.83 MB/s 747.13 MB/s 1.33 GB/s (200.00 MB)
(03,07)     | 436.54 MB/s 439.03 MB/s 441.52 MB/s (100.00 MB)
(04,00)     | 435.58 MB/s 435.96 MB/s 436.35 MB/s (100.00 MB)
(04,01)     | 424.26 MB/s 431.20 MB/s 438.15 MB/s (100.00 MB)
(04,02)     | 435.56 MB/s 436.17 MB/s 436.79 MB/s (100.00 MB)
(04,03)     | 436.64 MB/s 436.77 MB/s 436.91 MB/s (100.00 MB)
(04,05)     | 434.97 MB/s 437.67 MB/s 652.00 MB/s (150.00 MB)
(04,06)     | 436.65 MB/s 437.37 MB/s 607.19 MB/s (150.00 MB)
(04,07)     | 435.82 MB/s 436.90 MB/s 437.98 MB/s (100.00 MB)
(05,06)     | 1.15 GB/s 1.19 GB/s 1.22 GB/s (100.00 MB)
(05,07)     | 1.70 GB/s 1.70 GB/s 1.70 GB/s (50.00 MB)
(06,02)     | 2.03 GB/s 2.03 GB/s 2.03 GB/s (50.00 MB)
(06,05)     | 2.17 GB/s 2.17 GB/s 2.17 GB/s (50.00 MB)
(06,07)     | 1.17 GB/s 1.17 GB/s 1.17 GB/s (50.00 MB)
(07,01)     | 2.00 GB/s 2.00 GB/s 2.00 GB/s (50.00 MB)

@pentschev
Copy link
Member

Performance seems very bad, I guess we still got some work to do for this kind of benchmark.

Should add it doesn't appear we are specifying pool size generally when using the RMM pool. Maybe that will cause us some issues with performance.

Yeah, we're probably using the 50% default, this could be a problem in the first run, but since the second run onwards should have a pool already setup (in the event it needs to be increased), I think numbers will be consistent. Perhaps the ideal would be to let the user specify the pool size.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jakirkham !

@pentschev pentschev merged commit 8d42f27 into rapidsai:branch-0.16 Sep 14, 2020
@jakirkham jakirkham deleted the add_map_overlap_bench branch September 14, 2020 22:27
@jakirkham
Copy link
Member Author

Thanks Peter! 😄

Yeah I'm wondering if we need a larger overlap to see a benefit. Playing with that locally to see if that helps.

Ah ok. Good to know. I can take a look at making this configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants