Implement `segmented_row_bit_count` for computing row sizes by segments of rows #15169

ttnghia · 2024-02-28T06:55:31Z

This implements cudf::segmented_bit_count, a version of cudf::row_bit_count with adding segment_length parameter to the interface. With the new parameter, segmented_bit_count allows to compute aggregate sizes for each "segment" of rows instead of computing size for each row.

Currently, only fixed-length segments are supported.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

cpp/include/cudf/transform.hpp

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

nvdbaranec

Some boundary cases that need to be handled but seems good otherwise.

cpp/src/transform/row_bit_count.cu

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

nvdbaranec

A couple more small things. Were there tests added to explicitly check the non-multiple-of-segment-size case?

cpp/src/transform/row_bit_count.cu

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

cpp/tests/transform/row_bit_count_test.cu

cpp/src/transform/row_bit_count.cu

cpp/tests/transform/row_bit_count_test.cu

cpp/include/cudf/transform.hpp

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia · 2024-03-01T18:22:27Z

/merge

This implements ORC chunked reader, to support reading ORC such that: * The output is multiple tables instead of once, each of them is issue when calling to `read_chunk()`, and has limited size which stays within a given `output_limit` parameter. * The temporary device memory usage can be limited by a soft limit `data_read_limit` parameter, allowing to read very large ORC files without OOM. * ORC files containing many billions of rows can be properly read chunk-by-chunk without seeing the size overflow issue when the number of rows exceeds cudf size limit (`2^31` rows). Depends on: * #14911 * #15008 * #15169 * #15252 Partially contribute to #12228. --- ## Benchmarks Due to some small optimizations in ORC reader, reading ORC files all-at-once (reading the entire file into just one output table) can be a little bit faster. For example, with the benchmark `orc_read_io_compression`: ``` ## [0] Quadro RTX 6000 | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|---------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 157.293 ms | 4.72% | -25733.837 us | -14.06% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 164.395 ms | 4.14% | -33833.020 us | -17.07% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 82.522 ms | 1.36% | -14153.945 us | -14.64% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 81.078 ms | 0.48% | -13429.672 us | -14.21% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 139.849 ms | 2.44% | -22018.910 us | -13.60% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 142.041 ms | 3.43% | -22861.258 us | -13.86% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 74.924 ms | 1.97% | -13374.607 us | -15.15% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 72.502 ms | 0.50% | -14645.122 us | -16.81% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 111.670 ms | 2.13% | -13320.483 us | -10.66% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 126.266 ms | 0.48% | -23591.543 us | -15.74% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 77.653 ms | 1.58% | -14846.471 us | -16.05% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 80.033 ms | 3.19% | -13340.002 us | -14.29% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 97.083 ms | 0.50% | -14709.530 us | -13.16% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 97.634 ms | 0.44% | -20012.301 us | -17.01% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 66.975 ms | 0.50% | -18007.403 us | -21.19% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 65.510 ms | 0.91% | -17137.910 us | -20.74% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 59.399 ms | 2.54% | -6138.560 us | -9.37% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 92.276 ms | 3.30% | -9150.278 us | -9.02% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 73.959 ms | 3.50% | -6173.818 us | -7.70% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 77.446 ms | 3.32% | -8786.606 us | -10.19% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 45.018 ms | 4.11% | -7171.043 us | -13.74% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 46.855 ms | 3.35% | -7809.803 us | -14.29% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 60.553 ms | 4.22% | -7422.279 us | -10.92% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 62.253 ms | 6.23% | -6232.340 us | -9.10% | FAIL | ``` When memory is limited, chunked read can help avoiding OOM but with some sort of performance trade-off. For example, for reading a table of size 500MB from file using 64MB output limits and 640 MB data read limit: ``` | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 350.824 ms | 2.74% | 167.796 ms | 91.68% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 322.414 ms | 3.46% | 124.186 ms | 62.65% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 133.363 ms | 4.78% | 36.686 ms | 37.95% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 128.897 ms | 0.37% | 34.389 ms | 36.39% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 316.637 ms | 4.21% | 154.769 ms | 95.61% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 326.043 ms | 3.06% | 161.141 ms | 97.72% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 124.819 ms | 5.17% | 36.520 ms | 41.36% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 123.047 ms | 5.82% | 35.900 ms | 41.19% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 285.718 ms | 0.78% | 160.728 ms | 128.59% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 263.491 ms | 2.89% | 113.633 ms | 75.83% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 127.881 ms | 0.86% | 35.382 ms | 38.25% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 128.022 ms | 0.98% | 34.650 ms | 37.11% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 241.064 ms | 1.89% | 129.271 ms | 115.64% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 248.134 ms | 3.08% | 130.488 ms | 110.92% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 118.049 ms | 5.99% | 33.066 ms | 38.91% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 114.577 ms | 2.34% | 31.929 ms | 38.63% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 232.466 ms | 3.28% | 166.928 ms | 254.71% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 221.578 ms | 1.43% | 120.152 ms | 118.46% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 120.604 ms | 0.35% | 40.471 ms | 50.50% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 125.521 ms | 3.93% | 39.289 ms | 45.56% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 182.943 ms | 0.29% | 130.754 ms | 250.54% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 190.501 ms | 0.49% | 135.836 ms | 248.49% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 107.172 ms | 3.56% | 39.197 ms | 57.66% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 108.097 ms | 2.92% | 39.611 ms | 57.84% | FAIL | ``` And if memory is too limited, chunked read with 8MB output limit/80MB data read limit: ``` | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | | io | compression | cardinality | run_length | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |---------------|---------------|---------------|--------------|------------|-------------|------------|-------------|------------|---------|----------| | FILEPATH | SNAPPY | 0 | 1 | 183.027 ms | 7.45% | 732.926 ms | 1.98% | 549.899 ms | 300.45% | FAIL | | FILEPATH | SNAPPY | 1000 | 1 | 198.228 ms | 6.43% | 834.309 ms | 4.21% | 636.081 ms | 320.88% | FAIL | | FILEPATH | SNAPPY | 0 | 32 | 96.676 ms | 6.19% | 363.033 ms | 1.66% | 266.356 ms | 275.51% | FAIL | | FILEPATH | SNAPPY | 1000 | 32 | 94.508 ms | 4.80% | 313.813 ms | 1.28% | 219.305 ms | 232.05% | FAIL | | FILEPATH | NONE | 0 | 1 | 161.868 ms | 5.40% | 607.700 ms | 2.90% | 445.832 ms | 275.43% | FAIL | | FILEPATH | NONE | 1000 | 1 | 164.902 ms | 5.80% | 616.101 ms | 3.46% | 451.199 ms | 273.62% | FAIL | | FILEPATH | NONE | 0 | 32 | 88.298 ms | 5.15% | 267.703 ms | 0.46% | 179.405 ms | 203.18% | FAIL | | FILEPATH | NONE | 1000 | 32 | 87.147 ms | 5.61% | 250.528 ms | 0.43% | 163.381 ms | 187.48% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 1 | 124.990 ms | 0.39% | 636.270 ms | 0.44% | 511.280 ms | 409.06% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 1 | 149.858 ms | 4.10% | 747.264 ms | 0.50% | 597.406 ms | 398.65% | FAIL | | HOST_BUFFER | SNAPPY | 0 | 32 | 92.499 ms | 4.46% | 359.660 ms | 0.19% | 267.161 ms | 288.82% | FAIL | | HOST_BUFFER | SNAPPY | 1000 | 32 | 93.373 ms | 4.14% | 311.608 ms | 0.43% | 218.235 ms | 233.73% | FAIL | | HOST_BUFFER | NONE | 0 | 1 | 111.792 ms | 0.50% | 493.797 ms | 0.13% | 382.005 ms | 341.71% | FAIL | | HOST_BUFFER | NONE | 1000 | 1 | 117.646 ms | 5.60% | 516.706 ms | 0.12% | 399.060 ms | 339.20% | FAIL | | HOST_BUFFER | NONE | 0 | 32 | 84.983 ms | 4.96% | 258.477 ms | 0.46% | 173.495 ms | 204.15% | FAIL | | HOST_BUFFER | NONE | 1000 | 32 | 82.648 ms | 4.42% | 248.028 ms | 5.30% | 165.380 ms | 200.10% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 1 | 65.538 ms | 4.02% | 606.010 ms | 3.76% | 540.472 ms | 824.68% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 1 | 101.427 ms | 4.10% | 742.774 ms | 4.64% | 641.347 ms | 632.33% | FAIL | | DEVICE_BUFFER | SNAPPY | 0 | 32 | 80.133 ms | 4.64% | 364.701 ms | 2.70% | 284.568 ms | 355.12% | FAIL | | DEVICE_BUFFER | SNAPPY | 1000 | 32 | 86.232 ms | 4.71% | 320.387 ms | 2.80% | 234.155 ms | 271.54% | FAIL | | DEVICE_BUFFER | NONE | 0 | 1 | 52.189 ms | 6.62% | 458.100 ms | 2.15% | 405.912 ms | 777.78% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 1 | 54.664 ms | 6.76% | 478.527 ms | 1.41% | 423.862 ms | 775.39% | FAIL | | DEVICE_BUFFER | NONE | 0 | 32 | 67.975 ms | 5.12% | 260.009 ms | 3.71% | 192.034 ms | 282.51% | FAIL | | DEVICE_BUFFER | NONE | 1000 | 32 | 68.485 ms | 4.86% | 243.705 ms | 2.09% | 175.220 ms | 255.85% | FAIL | ``` Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #15094

Implement row bit count with segment length

01ff60a

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Feb 28, 2024

ttnghia requested review from vuule and nvdbaranec February 28, 2024 06:55

ttnghia self-assigned this Feb 28, 2024

ttnghia requested a review from a team as a code owner February 28, 2024 06:55

Update copyright year

3a6d88b

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

nvdbaranec reviewed Feb 28, 2024

View reviewed changes

cpp/include/cudf/transform.hpp Outdated Show resolved Hide resolved

Rename to segmented_bit_count

cefbbf8

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia changed the title ~~Support computing row sizes by segments of rows in row_bit_count~~ Implement segmented_bit_count for computing row sizes by segments of rows Feb 28, 2024

nvdbaranec requested changes Feb 28, 2024

View reviewed changes

cpp/src/transform/row_bit_count.cu Outdated Show resolved Hide resolved

cpp/src/transform/row_bit_count.cu Outdated Show resolved Hide resolved

ttnghia added 5 commits February 28, 2024 12:44

Add comment

3abba0d

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Update test

f999f4c

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Fix boundary segment

33c9912

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Change comment

ac4d1e3

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Change comments again

d2be35b

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

nvdbaranec reviewed Feb 28, 2024

View reviewed changes

cpp/src/transform/row_bit_count.cu Outdated Show resolved Hide resolved

cpp/src/transform/row_bit_count.cu Outdated Show resolved Hide resolved

ttnghia added 3 commits February 28, 2024 14:00

Rename function and add lambda

0fd1c8f

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add explicit test to check when num_rows % segment_length != 0

bcc278c

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Fix test

2556829

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia force-pushed the row_bit_count branch from 859454a to 2556829 Compare February 28, 2024 22:05

nvdbaranec approved these changes Feb 28, 2024

View reviewed changes

vuule requested changes Feb 28, 2024

View reviewed changes

ttnghia added 3 commits February 28, 2024 16:02

Remove current_length

8a03639

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Use tabulate

ff3f9c1

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add condition on the parameter and update docs

f63f6f0

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia added 4 commits February 29, 2024 10:07

Rename segmented_bit_count to segmented_row_bit_count and fix docs

ef2e789

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Fix edge case when the input is empty

0fc6a48

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Revert tests

dfa6483

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add new tests

55c720f

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

github-actions bot added the CMake CMake build issue label Feb 29, 2024

ttnghia and others added 2 commits February 29, 2024 12:52

Merge branch 'branch-24.04' into row_bit_count

9c76053

Update header

d5525f0

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia changed the title ~~Implement segmented_bit_count for computing row sizes by segments of rows~~ Implement segmented_row_bit_count for computing row sizes by segments of rows Feb 29, 2024

Simplify length computation

9c1ec58

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

vuule self-requested a review February 29, 2024 19:59

Remove redundant capture

14edd64

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

vuule approved these changes Mar 1, 2024

View reviewed changes

rapids-bot bot merged commit 3b228e2 into rapidsai:branch-24.04 Mar 1, 2024
76 checks passed

ttnghia deleted the row_bit_count branch March 1, 2024 18:24

ttnghia mentioned this pull request Mar 4, 2024

Implement ORC chunked reader #15094

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `segmented_row_bit_count` for computing row sizes by segments of rows #15169

Implement `segmented_row_bit_count` for computing row sizes by segments of rows #15169

ttnghia commented Feb 28, 2024 •

edited

Loading

nvdbaranec left a comment

nvdbaranec left a comment

ttnghia commented Mar 1, 2024

Implement segmented_row_bit_count for computing row sizes by segments of rows #15169

Implement segmented_row_bit_count for computing row sizes by segments of rows #15169

Conversation

ttnghia commented Feb 28, 2024 • edited Loading

nvdbaranec left a comment

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

ttnghia commented Mar 1, 2024

Implement `segmented_row_bit_count` for computing row sizes by segments of rows #15169

Implement `segmented_row_bit_count` for computing row sizes by segments of rows #15169

ttnghia commented Feb 28, 2024 •

edited

Loading