Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40994: [C++][Parquet] RleBooleanDecoder supports DecodeArrow with nulls #40995

Merged
merged 9 commits into from
Apr 4, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Apr 4, 2024

Rationale for this change

Supports DecodeArrow with nulls in RleBooleanDecoder

What changes are included in this PR?

Supports DecodeArrow with nulls in RleBooleanDecoder

Are these changes tested?

Yes

Are there any user-facing changes?

currently not

@mapleFU mapleFU requested a review from wgtmac as a code owner April 4, 2024 05:15
Copy link

github-actions bot commented Apr 4, 2024

⚠️ GitHub issue #40994 has been automatically assigned in GitHub to PR creator.

@mapleFU mapleFU force-pushed the rle-boolean-impl-decode-arrow branch from 34d3c72 to 8a33bea Compare April 4, 2024 05:17
@mapleFU
Copy link
Member Author

mapleFU commented Apr 4, 2024

@pitrou @wgtmac

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Apr 4, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Apr 4, 2024

benchmark number wouldn't be great, because RleDecoder need to decode to bool first, and unpack8 goes to unpack32.

@pitrou
Copy link
Member

pitrou commented Apr 4, 2024

Well, it seems the benchmarks fail:

BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:100            3572 ns         3572 ns       191956 bytes_per_second=34.1742Mi/s items_per_second=286.674M/s
terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Unexpected end of stream
Abandon

@mapleFU
Copy link
Member Author

mapleFU commented Apr 4, 2024

Testing on My PC ( Amd 3800X, WSL2, gcc11.4, Release(-O2) )

RLE

BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:1            1490 ns         1490 ns       507916 bytes_per_second=81.913Mi/s items_per_second=687.136M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:1            3761 ns         3760 ns       187492 bytes_per_second=129.862Mi/s items_per_second=1.08936G/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1          94859 ns        94860 ns         7488 bytes_per_second=20.5896Mi/s items_per_second=172.718M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:1         439572 ns       439574 ns         1660 bytes_per_second=17.7729Mi/s items_per_second=149.09M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:100          4670 ns         4672 ns       146128 bytes_per_second=26.1272Mi/s items_per_second=219.171M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:100         15912 ns        15912 ns        44211 bytes_per_second=30.6856Mi/s items_per_second=257.41M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100       105908 ns       105905 ns         6467 bytes_per_second=18.4422Mi/s items_per_second=154.705M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:100       467966 ns       467968 ns         1418 bytes_per_second=16.6945Mi/s items_per_second=140.044M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:1000         4134 ns         4134 ns       169739 bytes_per_second=29.5278Mi/s items_per_second=247.697M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:1000        15159 ns        15159 ns        46279 bytes_per_second=32.2106Mi/s items_per_second=270.202M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000      101632 ns       101632 ns         7019 bytes_per_second=19.2175Mi/s items_per_second=161.208M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:1000      454805 ns       454805 ns         1562 bytes_per_second=17.1777Mi/s items_per_second=144.097M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:5000         4355 ns         4355 ns       157860 bytes_per_second=28.0298Mi/s items_per_second=235.131M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:5000        15928 ns        15928 ns        44130 bytes_per_second=30.6553Mi/s items_per_second=257.155M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000      114474 ns       114474 ns         5980 bytes_per_second=17.0617Mi/s items_per_second=143.124M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:5000      529471 ns       529466 ns         1303 bytes_per_second=14.7554Mi/s items_per_second=123.778M/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:10000         403 ns          403 ns      1734553 bytes_per_second=302.976Mi/s items_per_second=2.54155G/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:10000         406 ns          406 ns      1703918 bytes_per_second=1.17413Gi/s items_per_second=10.0857G/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:10000        439 ns          439 ns      1572933 bytes_per_second=4.34285Gi/s items_per_second=37.3048G/s
BM_DecodeArrowBooleanRle/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:10000        660 ns          660 ns      1053931 bytes_per_second=11.5619Gi/s items_per_second=99.316G/s

PLAIN:

BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:1             709 ns          709 ns      1202751 bytes_per_second=172.176Mi/s items_per_second=1.44432G/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:1             602 ns          602 ns       924857 bytes_per_second=810.604Mi/s items_per_second=6.79984G/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1          25105 ns        25105 ns        22236 bytes_per_second=77.7987Mi/s items_per_second=652.623M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:1          99624 ns        99607 ns         7260 bytes_per_second=78.4332Mi/s items_per_second=657.945M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:100          3402 ns         3402 ns       208561 bytes_per_second=35.8773Mi/s items_per_second=300.96M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:100         10747 ns        10747 ns        70905 bytes_per_second=45.4327Mi/s items_per_second=381.117M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:100        65832 ns        65832 ns         8721 bytes_per_second=29.6684Mi/s items_per_second=248.877M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:100       309811 ns       309809 ns         2156 bytes_per_second=25.2172Mi/s items_per_second=211.537M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:1000         4216 ns         4216 ns       144072 bytes_per_second=28.9514Mi/s items_per_second=242.862M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:1000        19173 ns        19173 ns        29821 bytes_per_second=25.4678Mi/s items_per_second=213.639M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:1000      117705 ns       117706 ns         5677 bytes_per_second=16.5933Mi/s items_per_second=139.195M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:1000      527801 ns       527790 ns         1359 bytes_per_second=14.8023Mi/s items_per_second=124.171M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:5000         5134 ns         5134 ns       128720 bytes_per_second=23.7753Mi/s items_per_second=199.442M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:5000        18807 ns        18807 ns        36614 bytes_per_second=25.9622Mi/s items_per_second=217.787M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:5000      122685 ns       122684 ns         6010 bytes_per_second=15.92Mi/s items_per_second=133.547M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:5000      611359 ns       611363 ns         1003 bytes_per_second=12.7788Mi/s items_per_second=107.197M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:1024/null_in_ten_thousand:10000         784 ns          784 ns       779600 bytes_per_second=155.664Mi/s items_per_second=1.3058G/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:4096/null_in_ten_thousand:10000        1789 ns         1789 ns       390985 bytes_per_second=272.889Mi/s items_per_second=2.28916G/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:16384/null_in_ten_thousand:10000       5843 ns         5843 ns       125562 bytes_per_second=334.283Mi/s items_per_second=2.80417G/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull/num_values:65536/null_in_ten_thousand:10000      22560 ns        22561 ns        30684 bytes_per_second=346.29Mi/s items_per_second=2.90489G/s

For all-nulls seems that Rle is faster because it hacks the logic without any checking

@mapleFU mapleFU force-pushed the rle-boolean-impl-decode-arrow branch from 5e31baa to cb1a842 Compare April 4, 2024 09:00
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved
@@ -963,8 +963,6 @@ TYPED_TEST(EncodingAdHocTyped, ByteStreamSplitArrowDirectPut) {
}

TYPED_TEST(EncodingAdHocTyped, RleArrowDirectPut) {
// TODO: test with nulls once RleBooleanDecoder::DecodeArrow supports them
this->null_probability_ = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have coverage yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what does this mean? The case is covered here

@pitrou
Copy link
Member

pitrou commented Apr 4, 2024

@github-actions crossbow submit -g cpp

Copy link

github-actions bot commented Apr 4, 2024

Revision: cb1a842

Submitted crossbow builds: ursacomputing/crossbow @ actions-35133d7bec

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions

current_index_in_batch = 0;
};
if (null_count == 0) {
int sum_decode_count = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce a separate variable? We know that we're going to return the original value of num_values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it, it's copy-pasted from origin impl

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please simplify the code a bit. There's a bunch of local variables which are mutually redundant.

@pitrou pitrou merged commit b99b00d into apache:main Apr 4, 2024
32 of 34 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Apr 4, 2024
@pitrou
Copy link
Member

pitrou commented Apr 4, 2024

This is a nice improvement, thank you @mapleFU !

@mapleFU mapleFU deleted the rle-boolean-impl-decode-arrow branch April 4, 2024 15:35
pitrou pushed a commit that referenced this pull request Apr 8, 2024
…oding (#41037)

### Rationale for this change

Previous patches ( #40876 , #40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: #41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
verma-kartik pushed a commit to verma-kartik/arrow that referenced this pull request Apr 11, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
… with nulls (apache#40995)

### Rationale for this change

Supports DecodeArrow with nulls in RleBooleanDecoder

### What changes are included in this PR?

Supports DecodeArrow with nulls in RleBooleanDecoder

### Are these changes tested?

Yes

### Are there any user-facing changes?

currently not

* GitHub Issue: apache#40994

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
… with nulls (apache#40995)

### Rationale for this change

Supports DecodeArrow with nulls in RleBooleanDecoder

### What changes are included in this PR?

Supports DecodeArrow with nulls in RleBooleanDecoder

### Are these changes tested?

Yes

### Are there any user-facing changes?

currently not

* GitHub Issue: apache#40994

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
… with nulls (apache#40995)

### Rationale for this change

Supports DecodeArrow with nulls in RleBooleanDecoder

### What changes are included in this PR?

Supports DecodeArrow with nulls in RleBooleanDecoder

### Are these changes tested?

Yes

### Are there any user-facing changes?

currently not

* GitHub Issue: apache#40994

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
… with nulls (apache#40995)

### Rationale for this change

Supports DecodeArrow with nulls in RleBooleanDecoder

### What changes are included in this PR?

Supports DecodeArrow with nulls in RleBooleanDecoder

### Are these changes tested?

Yes

### Are there any user-facing changes?

currently not

* GitHub Issue: apache#40994

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
… with nulls (apache#40995)

### Rationale for this change

Supports DecodeArrow with nulls in RleBooleanDecoder

### What changes are included in this PR?

Supports DecodeArrow with nulls in RleBooleanDecoder

### Are these changes tested?

Yes

### Are there any user-facing changes?

currently not

* GitHub Issue: apache#40994

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit b99b00d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants