-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark for da.map_overlap
#399
Add benchmark for da.map_overlap
#399
Conversation
Running this on a DGX-1 provides the following results: Using CuPy:Roundtrip benchmark
--------------------------
Size | 10000*10000
Chunk-size | 128 MiB
Ignore-size | 1.05 MB
Protocol | tcp
Device(s) | 0
==========================
Wall-clock | npartitions
--------------------------
3.84 s | 16
2.50 s | 16
2.74 s | 16
==========================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------- Using NumPy:Roundtrip benchmark
--------------------------
Size | 10000*10000
Chunk-size | 128 MiB
Ignore-size | 1.05 MB
Protocol | tcp
Device(s) | 0
==========================
Wall-clock | npartitions
--------------------------
4.36 s | 16
4.37 s | 16
4.34 s | 16
==========================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------- |
5e8871a
to
ba38f68
Compare
ba38f68
to
537f551
Compare
Codecov Report
@@ Coverage Diff @@
## branch-0.16 #399 +/- ##
===============================================
- Coverage 59.74% 56.39% -3.36%
===============================================
Files 17 18 +1
Lines 1329 1431 +102
===============================================
+ Hits 794 807 +13
- Misses 535 624 +89
Continue to review full report at Codecov.
|
I don't see why not, but could you also check CuPy with UCX works @jakirkham ? |
Also, you're only using one device on the tests above, we only see UCX benefits when we have |
Ah ok. Thanks for pointing that out. Tried by running the following: python dask_cuda/benchmarks/local_cupy_map_overlap.py -d 0,1,2,3,4,5,6,7 -p ucx --enable-nvlink -t <gpu/cpu> Should add it doesn't appear we are specifying pool size generally when using the RMM pool. Maybe that will cause us some issues with performance. Using CuPy:Roundtrip benchmark
--------------------------
Size | 10000*10000
Chunk-size | 128 MiB
Ignore-size | 1.05 MB
Protocol | ucx
Device(s) | 0,1,2,3,4,5,6,7
==========================
Wall-clock | npartitions
--------------------------
11.38 s | 16
5.13 s | 16
5.21 s | 16
==========================
(w1,w2) | 25% 50% 75% (total nbytes)
--------------------------
(00,04) | 8.66 MB/s 8.66 MB/s 8.66 MB/s (50.00 MB)
(01,05) | 2.42 GB/s 2.42 GB/s 2.42 GB/s (50.00 MB)
(02,04) | 407.41 MB/s 407.41 MB/s 407.41 MB/s (50.00 MB)
(02,05) | 137.12 MB/s 137.12 MB/s 137.12 MB/s (50.00 MB)
(03,06) | 175.29 MB/s 175.29 MB/s 175.29 MB/s (50.00 MB)
(03,07) | 1.30 GB/s 1.30 GB/s 1.30 GB/s (50.00 MB)
(04,00) | 407.91 MB/s 407.91 MB/s 407.91 MB/s (50.00 MB)
(05,00) | 98.13 MB/s 127.34 MB/s 156.54 MB/s (100.00 MB)
(05,01) | 187.39 MB/s 214.38 MB/s 241.37 MB/s (100.00 MB)
(05,02) | 95.70 MB/s 107.38 MB/s 119.07 MB/s (100.00 MB)
(05,03) | 83.18 MB/s 105.36 MB/s 205.68 MB/s (150.00 MB)
(05,04) | 139.25 MB/s 182.85 MB/s 226.44 MB/s (100.00 MB)
(05,06) | 84.34 MB/s 158.50 MB/s 213.36 MB/s (150.00 MB)
(05,07) | 139.66 MB/s 179.33 MB/s 219.00 MB/s (100.00 MB)
(06,00) | 87.86 MB/s 91.71 MB/s 106.00 MB/s (150.00 MB)
(06,01) | 88.56 MB/s 100.07 MB/s 138.79 MB/s (200.00 MB)
(06,02) | 235.20 MB/s 256.63 MB/s 316.94 MB/s (300.00 MB)
(06,03) | 74.63 MB/s 130.78 MB/s 242.48 MB/s (200.00 MB)
(06,04) | 102.60 MB/s 112.97 MB/s 144.62 MB/s (150.00 MB)
(06,05) | 129.47 MB/s 129.74 MB/s 138.84 MB/s (150.00 MB)
(06,07) | 141.32 MB/s 193.52 MB/s 307.85 MB/s (300.00 MB)
(07,00) | 9.11 MB/s 9.11 MB/s 9.11 MB/s (50.00 MB)
(07,01) | 359.07 MB/s 359.07 MB/s 359.07 MB/s (50.00 MB) Using NumPy:Roundtrip benchmark
--------------------------
Size | 10000*10000
Chunk-size | 128 MiB
Ignore-size | 1.05 MB
Protocol | ucx
Device(s) | 0,1,2,3,4,5,6,7
==========================
Wall-clock | npartitions
--------------------------
2.43 s | 16
2.03 s | 16
2.03 s | 16
==========================
(w1,w2) | 25% 50% 75% (total nbytes)
--------------------------
(00,04) | 1.46 GB/s 1.46 GB/s 1.46 GB/s (50.00 MB)
(00,06) | 1.59 GB/s 1.59 GB/s 1.59 GB/s (50.08 MB)
(00,07) | 1.60 GB/s 1.60 GB/s 1.60 GB/s (50.00 MB)
(01,00) | 532.86 MB/s 698.01 MB/s 1.03 GB/s (200.08 MB)
(01,02) | 459.79 MB/s 498.68 MB/s 537.57 MB/s (100.00 MB)
(01,03) | 451.52 MB/s 490.99 MB/s 530.46 MB/s (100.00 MB)
(01,04) | 496.55 MB/s 565.75 MB/s 951.16 MB/s (150.00 MB)
(01,05) | 426.74 MB/s 426.74 MB/s 426.74 MB/s (50.00 MB)
(01,06) | 452.31 MB/s 483.82 MB/s 515.32 MB/s (100.00 MB)
(01,07) | 459.77 MB/s 498.58 MB/s 537.38 MB/s (100.00 MB)
(02,03) | 1.69 GB/s 1.69 GB/s 1.69 GB/s (50.00 MB)
(02,06) | 1.33 GB/s 1.33 GB/s 1.33 GB/s (50.00 MB)
(02,07) | 760.93 MB/s 760.93 MB/s 760.93 MB/s (50.00 MB)
(03,00) | 433.91 MB/s 436.13 MB/s 438.35 MB/s (100.00 MB)
(03,01) | 425.59 MB/s 431.46 MB/s 437.34 MB/s (100.00 MB)
(03,02) | 436.09 MB/s 438.24 MB/s 440.38 MB/s (100.00 MB)
(03,04) | 435.72 MB/s 437.72 MB/s 439.71 MB/s (100.00 MB)
(03,05) | 436.94 MB/s 438.91 MB/s 440.87 MB/s (100.00 MB)
(03,06) | 440.83 MB/s 747.13 MB/s 1.33 GB/s (200.00 MB)
(03,07) | 436.54 MB/s 439.03 MB/s 441.52 MB/s (100.00 MB)
(04,00) | 435.58 MB/s 435.96 MB/s 436.35 MB/s (100.00 MB)
(04,01) | 424.26 MB/s 431.20 MB/s 438.15 MB/s (100.00 MB)
(04,02) | 435.56 MB/s 436.17 MB/s 436.79 MB/s (100.00 MB)
(04,03) | 436.64 MB/s 436.77 MB/s 436.91 MB/s (100.00 MB)
(04,05) | 434.97 MB/s 437.67 MB/s 652.00 MB/s (150.00 MB)
(04,06) | 436.65 MB/s 437.37 MB/s 607.19 MB/s (150.00 MB)
(04,07) | 435.82 MB/s 436.90 MB/s 437.98 MB/s (100.00 MB)
(05,06) | 1.15 GB/s 1.19 GB/s 1.22 GB/s (100.00 MB)
(05,07) | 1.70 GB/s 1.70 GB/s 1.70 GB/s (50.00 MB)
(06,02) | 2.03 GB/s 2.03 GB/s 2.03 GB/s (50.00 MB)
(06,05) | 2.17 GB/s 2.17 GB/s 2.17 GB/s (50.00 MB)
(06,07) | 1.17 GB/s 1.17 GB/s 1.17 GB/s (50.00 MB)
(07,01) | 2.00 GB/s 2.00 GB/s 2.00 GB/s (50.00 MB) |
Performance seems very bad, I guess we still got some work to do for this kind of benchmark.
Yeah, we're probably using the 50% default, this could be a problem in the first run, but since the second run onwards should have a pool already setup (in the event it needs to be increased), I think numbers will be consistent. Perhaps the ideal would be to let the user specify the pool size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jakirkham !
Thanks Peter! 😄 Yeah I'm wondering if we need a larger overlap to see a benefit. Playing with that locally to see if that helps. Ah ok. Good to know. I can take a look at making this configurable. |
Includes a benchmark for
da.map_overlap
on arrays. This is another case where communication is important due to the need to communicate information from neighboring chunks as part of each computation. Should provide us a way to track progress on this front.cc @pentschev @GenevieveBuckley @quasiben
xref: dask/dask#4803