-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add an ORC reader benchmark that uses multiple CUDA streams #15973
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
feature request
New feature or request
good first issue
Good for newcomers
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Milestone
Comments
github-actions
bot
added
the
External
Issues or PRs created by external contributors
label
Jun 11, 2024
Matt711
added
0 - Backlog
In queue waiting for assignment
Needs Triage
Need team to review and classify
libcudf
Affects libcudf (C++/CUDA) code.
cuIO
cuIO issue
Performance
Performance related issue
labels
Jun 11, 2024
GregoryKimball
added
good first issue
Good for newcomers
and removed
Needs Triage
Need team to review and classify
External
Issues or PRs created by external contributors
labels
Jun 11, 2024
I'm planning to work on this btw. |
rapids-bot bot
pushed a commit
that referenced
this issue
Jun 14, 2024
Addresses: #15973 Adds multithreaded benchmarks for the ORC reader. Based off of the parquet equivalent in #15585 ``` # Benchmark Results ## orc_multithreaded_read_decode_mixed ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 338x | 44.348 ms | 1.18% | 44.343 ms | 1.18% | 12107185968 | 939.341 MiB | 39.557 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 80x | 77.634 ms | 0.65% | 77.629 ms | 0.65% | 13831742649 | 1.834 GiB | 79.072 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 341x | 43.921 ms | 1.20% | 43.916 ms | 1.20% | 12224889363 | 825.333 MiB | 39.568 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 80x | 75.418 ms | 0.70% | 75.414 ms | 0.70% | 14237999015 | 1.611 GiB | 79.113 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 80x | 42.682 ms | 1.18% | 42.678 ms | 1.18% | 12579566132 | 883.436 MiB | 39.587 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 9x | 74.056 ms | 0.48% | 74.052 ms | 0.48% | 14499873867 | 1.724 GiB | 79.136 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 25x | 42.198 ms | 0.50% | 42.194 ms | 0.49% | 12723960975 | 940.562 MiB | 39.600 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 8x | 73.933 ms | 0.49% | 73.929 ms | 0.49% | 14524042443 | 1.781 GiB | 79.175 MiB | ## orc_multithreaded_read_decode_fixed_width ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 13x | 40.149 ms | 0.04% | 40.144 ms | 0.04% | 13373482726 | 643.390 MiB | 59.821 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 211x | 71.216 ms | 0.67% | 71.211 ms | 0.67% | 15078297784 | 1.257 GiB | 119.650 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 378x | 39.662 ms | 1.31% | 39.658 ms | 1.31% | 13537590893 | 643.392 MiB | 59.833 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 209x | 71.693 ms | 0.71% | 71.688 ms | 0.71% | 14978085376 | 1.257 GiB | 119.642 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 377x | 39.731 ms | 1.30% | 39.726 ms | 1.30% | 13514305239 | 643.394 MiB | 59.856 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 8x | 70.766 ms | 0.08% | 70.761 ms | 0.08% | 15174115364 | 1.030 GiB | 119.665 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 379x | 39.486 ms | 1.27% | 39.482 ms | 1.27% | 13597888468 | 647.399 MiB | 59.928 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 207x | 72.686 ms | 2.04% | 72.681 ms | 2.04% | 14773317833 | 1.143 GiB | 119.711 MiB | ## orc_multithreaded_read_decode_string ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 80x | 22.933 ms | 2.13% | 22.928 ms | 2.13% | 23415352877 | 661.948 MiB | 10.879 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 160x | 34.167 ms | 1.41% | 34.162 ms | 1.41% | 31430436877 | 1.293 GiB | 21.757 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 560x | 22.533 ms | 2.18% | 22.528 ms | 2.18% | 23830839172 | 609.407 MiB | 10.941 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 80x | 34.311 ms | 1.54% | 34.307 ms | 1.54% | 31298288990 | 1.188 GiB | 21.758 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 23x | 22.179 ms | 0.11% | 22.175 ms | 0.11% | 24211151047 | 624.177 MiB | 10.947 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 15x | 33.793 ms | 0.08% | 33.789 ms | 0.08% | 31777989791 | 1.190 GiB | 21.881 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 679x | 22.006 ms | 1.74% | 22.002 ms | 1.74% | 24401381631 | 624.524 MiB | 10.951 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 160x | 33.320 ms | 1.57% | 33.316 ms | 1.57% | 32229227026 | 1.207 GiB | 21.894 MiB | ## orc_multithreaded_read_decode_list ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|---------|------------|--------|------------|--------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 96x | 74.437 ms | 0.68% | 74.433 ms | 0.68% | 7212831148 | 600.751 MiB | 60.245 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 7x | 80.994 ms | 0.49% | 80.990 ms | 0.49% | 13257745936 | 1.173 GiB | 120.549 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 80x | 79.234 ms | 4.57% | 79.229 ms | 4.57% | 6776190522 | 600.950 MiB | 60.250 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 166x | 90.437 ms | 17.19% | 90.432 ms | 17.19% | 11873413959 | 1.173 GiB | 120.489 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 80x | 78.613 ms | 2.98% | 78.608 ms | 2.98% | 6829702014 | 602.764 MiB | 60.323 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 127x | 118.629 ms | 22.67% | 118.624 ms | 22.67% | 9051644873 | 1.174 GiB | 120.499 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 112x | 133.950 ms | 4.45% | 133.945 ms | 4.45% | 4008135293 | 603.471 MiB | 60.353 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 90x | 167.850 ms | 15.93% | 167.844 ms | 15.93% | 6397248426 | 1.177 GiB | 120.646 MiB | ## orc_multithreaded_read_decode_chunked_mixed ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 333x | 45.009 ms | 1.10% | 45.005 ms | 1.10% | 11929261073 | 939.341 MiB | 39.557 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 96x | 81.524 ms | 0.61% | 81.519 ms | 0.61% | 13171640865 | 1.834 GiB | 79.072 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 339x | 44.183 ms | 0.96% | 44.179 ms | 0.96% | 12152252271 | 825.333 MiB | 39.568 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 7x | 79.051 ms | 0.02% | 79.046 ms | 0.02% | 13583676002 | 1.611 GiB | 79.113 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 12x | 43.276 ms | 0.09% | 43.272 ms | 0.09% | 12407024794 | 883.436 MiB | 39.587 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 19x | 78.019 ms | 0.49% | 78.014 ms | 0.49% | 13763433041 | 1.724 GiB | 79.136 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 42.803 ms | 1.22% | 42.799 ms | 1.22% | 12543864010 | 911.993 MiB | 39.600 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 193x | 77.856 ms | 0.59% | 77.852 ms | 0.59% | 13792063986 | 1.837 GiB | 79.175 MiB | ## orc_multithreaded_read_decode_chunked_fixed_width ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 112x | 40.497 ms | 1.23% | 40.493 ms | 1.23% | 13258480947 | 643.390 MiB | 59.821 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 7x | 75.440 ms | 0.09% | 75.435 ms | 0.09% | 14234033611 | 1.648 GiB | 119.651 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 39.793 ms | 1.36% | 39.789 ms | 1.36% | 13493067216 | 643.392 MiB | 59.833 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 69x | 74.499 ms | 0.50% | 74.494 ms | 0.50% | 14413864845 | 1.336 GiB | 119.642 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 381x | 39.273 ms | 1.11% | 39.269 ms | 1.11% | 13671742653 | 643.394 MiB | 59.856 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 204x | 73.755 ms | 0.60% | 73.751 ms | 0.60% | 14559012350 | 1.648 GiB | 119.665 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 39.490 ms | 1.31% | 39.486 ms | 1.31% | 13596333864 | 631.980 MiB | 59.928 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 203x | 73.907 ms | 1.34% | 73.903 ms | 1.34% | 14529071322 | 1.454 GiB | 119.711 MiB | ## orc_multithreaded_read_decode_chunked_string ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|-----------|-------|-----------|-------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 23.022 ms | 1.96% | 23.017 ms | 1.96% | 23324556592 | 661.948 MiB | 10.879 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 37.687 ms | 1.37% | 37.682 ms | 1.37% | 28494755419 | 1.659 GiB | 21.757 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 22.703 ms | 2.30% | 22.699 ms | 2.30% | 23652118769 | 609.407 MiB | 10.941 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 80x | 37.581 ms | 1.42% | 37.577 ms | 1.42% | 28574723179 | 1.658 GiB | 21.758 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 544x | 22.296 ms | 1.56% | 22.293 ms | 1.56% | 24082840350 | 631.319 MiB | 10.947 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 14x | 36.990 ms | 0.14% | 36.985 ms | 0.14% | 29031484389 | 1.554 GiB | 21.881 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 676x | 22.114 ms | 1.22% | 22.110 ms | 1.22% | 24281965280 | 627.616 MiB | 10.951 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 37.409 ms | 1.40% | 37.405 ms | 1.40% | 28706077426 | 1.562 GiB | 21.894 MiB | ## orc_multithreaded_read_decode_chunked_list ### [0] NVIDIA RTX 5880 Ada Generation | cardinality | total_data_size | num_threads | num_cols | run_length | input_limit | output_limit | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-----------------|-------------|----------|------------|-------------|--------------|---------|------------|--------|------------|--------|------------------|-------------------|-------------------| | 1000 | 536870912 | 1 | 4 | 8 | 671088640 | 671088640 | 80x | 74.780 ms | 0.67% | 74.776 ms | 0.67% | 7179747067 | 600.751 MiB | 60.245 MiB | | 1000 | 1073741824 | 1 | 4 | 8 | 671088640 | 671088640 | 175x | 86.040 ms | 0.56% | 86.035 ms | 0.56% | 12480222210 | 1.576 GiB | 120.549 MiB | | 1000 | 536870912 | 2 | 4 | 8 | 671088640 | 671088640 | 186x | 80.668 ms | 4.14% | 80.664 ms | 4.14% | 6655685080 | 600.951 MiB | 60.250 MiB | | 1000 | 1073741824 | 2 | 4 | 8 | 671088640 | 671088640 | 143x | 105.217 ms | 21.56% | 105.212 ms | 21.56% | 10205531345 | 1.576 GiB | 120.489 MiB | | 1000 | 536870912 | 4 | 4 | 8 | 671088640 | 671088640 | 128x | 80.087 ms | 3.05% | 80.082 ms | 3.05% | 6704042147 | 602.764 MiB | 60.323 MiB | | 1000 | 1073741824 | 4 | 4 | 8 | 671088640 | 671088640 | 135x | 111.556 ms | 21.88% | 111.551 ms | 21.88% | 9625546746 | 1.489 GiB | 120.499 MiB | | 1000 | 536870912 | 8 | 4 | 8 | 671088640 | 671088640 | 112x | 134.677 ms | 4.14% | 134.672 ms | 4.14% | 3986513604 | 603.471 MiB | 60.353 MiB | | 1000 | 1073741824 | 8 | 4 | 8 | 671088640 | 671088640 | 80x | 178.735 ms | 14.17% | 178.730 ms | 14.17% | 6007630497 | 1.520 GiB | 120.646 MiB | ``` Authors: - Zach Puller (https://github.com/zpuller) - Vukasin Milovanovic (https://github.com/vuule) - MithunR (https://github.com/mythrocks) Approvers: - Yunsong Wang (https://github.com/PointKernel) - MithunR (https://github.com/mythrocks) URL: #16009
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
feature request
New feature or request
good first issue
Good for newcomers
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Is your feature request related to a problem? Please describe.
This is an extension of #12700 to provide a benchmark for multi-stream ORC reads, which is also common in Spark-RAPIDS, similar to parquet.
Describe the solution you'd like
Again, similar to #12700, a libcudf microbenchmark that creates several host threads, each with it's own non-default CUDA stream, and then reads a large ORC dataset from host memory into a libcudf table, using the read_orc detail api.
Describe alternatives you've considered
The alternative would be to continue using Spark-RAPIDS NDS runs to track performance of libcudf's parquet reader in a multi-threaded, multi-stream use case.
The text was updated successfully, but these errors were encountered: