Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in cuDF merge benchmark #935

Open
pentschev opened this issue Jun 16, 2022 · 7 comments
Open

Performance regression in cuDF merge benchmark #935

pentschev opened this issue Jun 16, 2022 · 7 comments

Comments

@pentschev
Copy link
Member

pentschev commented Jun 16, 2022

Running the cuDF benchmark with RAPIDS 22.06 results in the following:

RAPIDS 22.06 cuDF benchmark
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
20.70 s        | 294.80 MiB/s
17.62 s        | 346.49 MiB/s
39.32 s        | 155.22 MiB/s
================================================================================
Throughput     | 265.50 MiB +/- 80.79 MiB
Wall-Clock     | 25.88 s +/- 9.59 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB)
(02,01)        | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB)

If we roll back one year, to RAPIDS 21.06 performance was substantially superior:

RAPIDS 21.06 cuDF benchmark
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock     | Throughput
-------------------------------
15.40 s        | 396.40 MiB/s
7.35 s         | 830.55 MiB/s
8.80 s         | 693.83 MiB/s
===============================
(w1,w2)     | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)     | 325.82 MiB/s 332.85 MiB/s 351.81 MiB/s (12.85 GiB)
(02,01)     | 296.46 MiB/s 321.66 MiB/s 333.66 MiB/s (12.85 GiB)

It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.

@pentschev
Copy link
Member Author

RAPIDS 21.12 and 22.02 perform better than 21.06. The regression appeared first in 22.04, see results below.

RAPIDS 21.06 cuDF benchmark - 10 iterations
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock     | Throughput
-------------------------------
8.28 s         | 737.58 MiB/s
15.94 s        | 382.80 MiB/s
16.27 s        | 375.17 MiB/s
15.90 s        | 383.76 MiB/s
15.54 s        | 392.86 MiB/s
15.67 s        | 389.50 MiB/s
15.52 s        | 393.30 MiB/s
16.02 s        | 381.04 MiB/s
7.72 s         | 790.71 MiB/s
8.35 s         | 730.57 MiB/s
===============================
(w1,w2)     | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)     | 291.99 MiB/s 317.62 MiB/s 402.70 MiB/s (51.04 GiB)
(02,01)     | 295.38 MiB/s 327.51 MiB/s 401.62 MiB/s (51.03 GiB)
RAPIDS 21.12 cuDF benchmark - 10 iterations
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock     | Throughput
-------------------------------
5.91 s         | 1.01 GiB/s
5.72 s         | 1.04 GiB/s
11.16 s        | 546.74 MiB/s
4.82 s         | 1.24 GiB/s
4.87 s         | 1.22 GiB/s
4.83 s         | 1.23 GiB/s
5.72 s         | 1.04 GiB/s
5.78 s         | 1.03 GiB/s
5.76 s         | 1.03 GiB/s
11.18 s        | 546.06 MiB/s
===============================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 429.34 MiB/s 509.66 MiB/s 626.34 MiB/s (39.86 GiB)
(02,01)        | 419.16 MiB/s 502.99 MiB/s 633.16 MiB/s (39.86 GiB)
RAPIDS 22.02 cuDF benchmark - 10 iterations
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
4.79 s         | 1.24 GiB/s
10.99 s        | 555.41 MiB/s
10.05 s        | 607.36 MiB/s
10.29 s        | 593.14 MiB/s
9.94 s         | 614.12 MiB/s
10.37 s        | 588.66 MiB/s
4.78 s         | 1.25 GiB/s
5.71 s         | 1.04 GiB/s
10.13 s        | 602.58 MiB/s
4.69 s         | 1.27 GiB/s
================================================================================
Throughput     | 848.05 MiB +/- 317.55 MiB
Wall-Clock     | 8.17 s +/- 2.62 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 428.54 MiB/s 478.27 MiB/s 562.20 MiB/s (48.80 GiB)
(02,01)        | 440.11 MiB/s 513.94 MiB/s 562.02 MiB/s (48.80 GiB)
RAPIDS 22.04 cuDF benchmark - 10 iterations
$ python local_cudf_merge.py -d 1,2 -c 100_000_000 --runs 10
2022-06-20 02:01:12,323 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-20 02:01:12,325 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
49.11 s        | 124.29 MiB/s
48.77 s        | 125.15 MiB/s
45.11 s        | 135.31 MiB/s
44.94 s        | 135.81 MiB/s
44.13 s        | 138.30 MiB/s
40.67 s        | 150.07 MiB/s
48.73 s        | 125.25 MiB/s
15.46 s        | 394.67 MiB/s
48.26 s        | 126.47 MiB/s
44.82 s        | 136.17 MiB/s
================================================================================
Throughput     | 159.15 MiB +/- 78.88 MiB
Wall-Clock     | 43.00 s +/- 9.53 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 94.22 MiB/s 144.61 MiB/s 176.84 MiB/s (55.51 GiB)
(02,01)        | 108.26 MiB/s 126.38 MiB/s 144.91 MiB/s (55.51 GiB)

@pentschev
Copy link
Member Author

The reason for this behavior is compression. Dask 2022.3.0 (RAPIDS 22.04) depends on lz4, whereas Dask 2022.1.0 (RAPIDS 22.02) doesn't.

Distributed has by default the distributed.comm.compression=auto which ends up picking lz4 when available. Disabling compression entirely incurs in a significantly better bandwidth (~5x), severely reducing total runtime (~10x).

RAPIDS 22.04 (no compression)
$ DASK_DISTRIBUTED__COMM__COMPRESSION=None python dask_cud
a/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-21 04:35:28,295 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-21 04:35:28,298 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
3.96 s         | 1.51 GiB/s
4.04 s         | 1.47 GiB/s
3.95 s         | 1.51 GiB/s
================================================================================
Throughput     | 1.50 GiB +/- 16.61 MiB
Wall-Clock     | 3.98 s +/- 43.50 ms
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 655.04 MiB/s 707.73 MiB/s 717.14 MiB/s (10.62 GiB)
(02,01)        | 700.51 MiB/s 778.48 MiB/s 819.70 MiB/s (10.62 GiB)
RAPIDS 22.04 (lz4)
$ DASK_DISTRIBUTED__COMM__COMPRESSION=lz4 python dask_cuda
/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-21 04:22:57,556 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-21 04:22:57,558 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
41.16 s        | 148.28 MiB/s
44.95 s        | 135.77 MiB/s
45.06 s        | 135.44 MiB/s
================================================================================
Throughput     | 139.83 MiB +/- 5.97 MiB
Wall-Clock     | 43.73 s +/- 1.81 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 124.73 MiB/s 151.97 MiB/s 169.86 MiB/s (17.32 GiB)
(02,01)        | 115.92 MiB/s 132.03 MiB/s 144.96 MiB/s (17.32 GiB)

@quasiben @jakirkham do you have any ideas or suggestions on the best way to handle this? It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

@madsbk
Copy link
Member

madsbk commented Jun 21, 2022

Good catch @pentschev !

It feels to me like Dask-CUDA/Dask-cuDF should disable compression by default or find a suitable alternative to the CPU compression algorithms that are available by default.

I agree, we should disable compression by default for now.
If we want to make compression available, we could use KvikIO's Python bindings of nvCOMP.

@pentschev
Copy link
Member Author

That is a good idea @madsbk , is this something we plan adding to Distributed? It would be good to do that and do some testing/profiling.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

wence- added a commit to wence-/dask-cuda that referenced this issue Jul 21, 2022
For GPU data, compression is worse rather than better because it
provokes device-to-host transfers when they are unnecessary.

This is a short-term fix for rapidsai#935, in lieu of hooking up GPU-based
compression algorithms.
@wence-
Copy link
Contributor

wence- commented Jul 21, 2022

Short-term fix disabling compression is in #957.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants