-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement segmented_row_bit_count
for computing row sizes by segments of rows
#15169
Conversation
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
row_bit_count
segmented_bit_count
for computing row sizes by segments of rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some boundary cases that need to be handled but seems good otherwise.
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more small things. Were there tests added to explicitly check the non-multiple-of-segment-size case?
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
859454a
to
2556829
Compare
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
segmented_bit_count
for computing row sizes by segments of rowssegmented_row_bit_count
for computing row sizes by segments of rows
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
/merge |
This implements ORC chunked reader, to support reading ORC such that: * The output is multiple tables instead of once, each of them is issue when calling to `read_chunk()`, and has limited size which stays within a given `output_limit` parameter. * The temporary device memory usage can be limited by a soft limit `data_read_limit` parameter, allowing to read very large ORC files without OOM. * ORC files containing many billions of rows can be properly read chunk-by-chunk without seeing the size overflow issue when the number of rows exceeds cudf size limit (`2^31` rows). Depends on: * #14911 * #15008 * #15169 * #15252 Partially contribute to #12228. --- ## Benchmarks Due to some small optimizations in ORC reader, reading ORC files all-at-once (reading the entire file into just one output table) can be a little bit faster. For example, with the benchmark `orc_read_io_compression`: ``` ## [0] Quadro RTX 6000 | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 157.293 ms | 4.72% | -25733.837 us | -14.06% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 164.395 ms | 4.14% | -33833.020 us | -17.07% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 82.522 ms | 1.36% | -14153.945 us | -14.64% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 81.078 ms | 0.48% | -13429.672 us | -14.21% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 139.849 ms | 2.44% | -22018.910 us | -13.60% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 142.041 ms | 3.43% | -22861.258 us | -13.86% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 74.924 ms | 1.97% | -13374.607 us | -15.15% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 72.502 ms | 0.50% | -14645.122 us | -16.81% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 111.670 ms | 2.13% | -13320.483 us | -10.66% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 126.266 ms | 0.48% | -23591.543 us | -15.74% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 77.653 ms | 1.58% | -14846.471 us | -16.05% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 80.033 ms | 3.19% | -13340.002 us | -14.29% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 97.083 ms | 0.50% | -14709.530 us | -13.16% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 97.634 ms | 0.44% | -20012.301 us | -17.01% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 66.975 ms | 0.50% | -18007.403 us | -21.19% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 65.510 ms | 0.91% | -17137.910 us | -20.74% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 59.399 ms | 2.54% | -6138.560 us | -9.37% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 92.276 ms | 3.30% | -9150.278 us | -9.02% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 73.959 ms | 3.50% | -6173.818 us | -7.70% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 77.446 ms | 3.32% | -8786.606 us | -10.19% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 45.018 ms | 4.11% | -7171.043 us | -13.74% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 46.855 ms | 3.35% | -7809.803 us | -14.29% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 60.553 ms | 4.22% | -7422.279 us | -10.92% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 62.253 ms | 6.23% | -6232.340 us | -9.10% | FAIL | ``` When memory is limited, chunked read can help avoiding OOM but with some sort of performance trade-off. For example, for reading a table of size 500MB from file using 64MB output limits and 640 MB data read limit: ``` | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 350.824 ms | 2.74% | 167.796 ms | 91.68% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 322.414 ms | 3.46% | 124.186 ms | 62.65% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 133.363 ms | 4.78% | 36.686 ms | 37.95% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 128.897 ms | 0.37% | 34.389 ms | 36.39% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 316.637 ms | 4.21% | 154.769 ms | 95.61% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 326.043 ms | 3.06% | 161.141 ms | 97.72% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 124.819 ms | 5.17% | 36.520 ms | 41.36% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 123.047 ms | 5.82% | 35.900 ms | 41.19% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 285.718 ms | 0.78% | 160.728 ms | 128.59% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 263.491 ms | 2.89% | 113.633 ms | 75.83% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 127.881 ms | 0.86% | 35.382 ms | 38.25% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 128.022 ms | 0.98% | 34.650 ms | 37.11% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 241.064 ms | 1.89% | 129.271 ms | 115.64% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 248.134 ms | 3.08% | 130.488 ms | 110.92% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 118.049 ms | 5.99% | 33.066 ms | 38.91% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 114.577 ms | 2.34% | 31.929 ms | 38.63% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 232.466 ms | 3.28% | 166.928 ms | 254.71% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 221.578 ms | 1.43% | 120.152 ms | 118.46% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 120.604 ms | 0.35% | 40.471 ms | 50.50% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 125.521 ms | 3.93% | 39.289 ms | 45.56% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 182.943 ms | 0.29% | 130.754 ms | 250.54% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 190.501 ms | 0.49% | 135.836 ms | 248.49% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 107.172 ms | 3.56% | 39.197 ms | 57.66% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 108.097 ms | 2.92% | 39.611 ms | 57.84% | FAIL | ``` And if memory is too limited, chunked read with 8MB output limit/80MB data read limit: ``` | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 732.926 ms | 1.98% | 549.899 ms | 300.45% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 834.309 ms | 4.21% | 636.081 ms | 320.88% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 363.033 ms | 1.66% | 266.356 ms | 275.51% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 313.813 ms | 1.28% | 219.305 ms | 232.05% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 607.700 ms | 2.90% | 445.832 ms | 275.43% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 616.101 ms | 3.46% | 451.199 ms | 273.62% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 267.703 ms | 0.46% | 179.405 ms | 203.18% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 250.528 ms | 0.43% | 163.381 ms | 187.48% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 636.270 ms | 0.44% | 511.280 ms | 409.06% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 747.264 ms | 0.50% | 597.406 ms | 398.65% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 359.660 ms | 0.19% | 267.161 ms | 288.82% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 311.608 ms | 0.43% | 218.235 ms | 233.73% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 493.797 ms | 0.13% | 382.005 ms | 341.71% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 516.706 ms | 0.12% | 399.060 ms | 339.20% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 258.477 ms | 0.46% | 173.495 ms | 204.15% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 248.028 ms | 5.30% | 165.380 ms | 200.10% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 606.010 ms | 3.76% | 540.472 ms | 824.68% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 742.774 ms | 4.64% | 641.347 ms | 632.33% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 364.701 ms | 2.70% | 284.568 ms | 355.12% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 320.387 ms | 2.80% | 234.155 ms | 271.54% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 458.100 ms | 2.15% | 405.912 ms | 777.78% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 478.527 ms | 1.41% | 423.862 ms | 775.39% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 260.009 ms | 3.71% | 192.034 ms | 282.51% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 243.705 ms | 2.09% | 175.220 ms | 255.85% | FAIL | ``` Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #15094
This implements
cudf::segmented_bit_count
, a version ofcudf::row_bit_count
with addingsegment_length
parameter to the interface. With the new parameter,segmented_bit_count
allows to compute aggregate sizes for each "segment" of rows instead of computing size for each row.Currently, only fixed-length segments are supported.