Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40872: [C++][Parquet] Encoding: Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder #40876

Merged
merged 13 commits into from
Apr 3, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Mar 28, 2024

Rationale for this change

This is for enhance boolean decoding. I optimized the DecodeArrow for PlainBoolean

What changes are included in this PR?

Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder, and add benchmarks

Are these changes tested?

Yes

Are there any user-facing changes?

Minor optimization. And Decode boolean will change the syntax

Copy link

⚠️ GitHub issue #40872 has been automatically assigned in GitHub to PR creator.

@mapleFU mapleFU force-pushed the plain-boolean-decoding-arrow branch from 962dcfa to b03728b Compare March 28, 2024 16:22
@mapleFU mapleFU changed the title GH-40872: [C++][Parquet] Encoding: Optimize DecodeArrow for PlainBooleanDecoder GH-40872: [C++][Parquet] Encoding: Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder Mar 28, 2024
} else {
int64_t previous_offset = 0;
int64_t previous_value_offset = 0;
PARQUET_THROW_NOT_OK(::arrow::internal::VisitSetBitRuns(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data is interleaving "null, non-null, null, non-null, ...". Don't know that would performance be

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the best performance might getting a BitRun scanner here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with this. Are there benchmarks for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pasted this in the pr, PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmarks in this PR always have null_probability = 0, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Aha previously all null_probability is 0...

cpp/src/parquet/encoding.cc Show resolved Hide resolved
@github-actions github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Mar 29, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Mar 29, 2024

Before:

BM_PlainDecodingBoolean/1024                 249 ns          249 ns      2809214 bytes_per_second=3.82556G/s
BM_PlainDecodingBoolean/4096                 600 ns          600 ns      1164784 bytes_per_second=6.36244G/s
BM_PlainDecodingBoolean/32768               3858 ns         3853 ns       181202 bytes_per_second=7.91948G/s
BM_PlainDecodingBoolean/65536               7524 ns         7517 ns        93990 bytes_per_second=8.11909G/s
BM_PlainDecodingBooleanToBitmap/1024        7101 ns         7095 ns        97939 bytes_per_second=137.642M/s
BM_PlainDecodingBooleanToBitmap/4096       27520 ns        27510 ns        25488 bytes_per_second=141.994M/s
BM_PlainDecodingBooleanToBitmap/32768     219035 ns       218869 ns         3197 bytes_per_second=142.779M/s
BM_PlainDecodingBooleanToBitmap/65536     439026 ns       438836 ns         1598 bytes_per_second=142.422M/s

After:

BM_PlainDecodingBoolean/1024                 248 ns          248 ns      2820193 bytes_per_second=3.84033G/s
BM_PlainDecodingBoolean/4096                 610 ns          604 ns      1168361 bytes_per_second=6.31453G/s
BM_PlainDecodingBoolean/32768               3817 ns         3815 ns       181765 bytes_per_second=8.00006G/s
BM_PlainDecodingBoolean/65536               7483 ns         7478 ns        93656 bytes_per_second=8.16155G/s
BM_PlainDecodingBooleanToBitmap/1024         133 ns          133 ns      5271521 bytes_per_second=7.1644G/s
BM_PlainDecodingBooleanToBitmap/4096         139 ns          139 ns      5047009 bytes_per_second=27.3761G/s
BM_PlainDecodingBooleanToBitmap/32768        192 ns          192 ns      3691555 bytes_per_second=159.17G/s
BM_PlainDecodingBooleanToBitmap/65536        253 ns          253 ns      2845413 bytes_per_second=241.105G/s

@@ -1214,18 +1234,15 @@ inline int PlainBooleanDecoder::DecodeArrow(

int PlainBooleanDecoder::Decode(uint8_t* buffer, int max_values) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When refering to #38355 . Previously we won't crack the trailing bit, now we'll . Should we keep the same syntax? @pitrou @wgtmac

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's ok.

@mapleFU
Copy link
Member Author

mapleFU commented Apr 3, 2024

After:

BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:1024/null-prob:10        3631 ns         3612 ns       195883 bytes_per_second=33.7927M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:2048/null-prob:10        6854 ns         6724 ns       104659 bytes_per_second=36.3087M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:4096/null-prob:10       13050 ns        12957 ns        54526 bytes_per_second=37.6833M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:8192/null-prob:10       26314 ns        25650 ns        26516 bytes_per_second=38.0727M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:16384/null-prob:10      51909 ns        51337 ns        13720 bytes_per_second=38.0451M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:32768/null-prob:10     110306 ns       107744 ns         6624 bytes_per_second=36.2549M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:65536/null-prob:10     221000 ns       220405 ns         3107 bytes_per_second=35.4461M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:1024/null-prob:50        5016 ns         4974 ns       141186 bytes_per_second=24.5432M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:2048/null-prob:50        9773 ns         9529 ns        72019 bytes_per_second=25.6217M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:4096/null-prob:50       21108 ns        18906 ns        37395 bytes_per_second=25.8263M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:8192/null-prob:50       39810 ns        38057 ns        18095 bytes_per_second=25.6605M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:16384/null-prob:50     101819 ns        91886 ns         7236 bytes_per_second=21.256M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:32768/null-prob:50     209925 ns       201723 ns         3433 bytes_per_second=19.3644M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:65536/null-prob:50     440182 ns       431220 ns         1641 bytes_per_second=18.1172M/s

Before:

BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:1024/null-prob:10        8560 ns         8509 ns        81931 bytes_per_second=14.3458M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:2048/null-prob:10       16993 ns        16875 ns        41546 bytes_per_second=14.4676M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:4096/null-prob:10       33400 ns        33270 ns        20855 bytes_per_second=14.6761M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:8192/null-prob:10       68693 ns        67399 ns        10604 bytes_per_second=14.4893M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:16384/null-prob:10     136645 ns       134282 ns         5094 bytes_per_second=14.545M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:32768/null-prob:10     267212 ns       264922 ns         2656 bytes_per_second=14.7449M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:65536/null-prob:10     531085 ns       528560 ns         1307 bytes_per_second=14.7807M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:1024/null-prob:50        6681 ns         6655 ns       104864 bytes_per_second=18.3438M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:2048/null-prob:50       13626 ns        13570 ns        51543 bytes_per_second=17.9908M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:4096/null-prob:50       30337 ns        30249 ns        22973 bytes_per_second=16.1423M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:8192/null-prob:50       69539 ns        68845 ns        10315 bytes_per_second=14.1848M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:16384/null-prob:50     147163 ns       146116 ns         4624 bytes_per_second=13.3669M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:32768/null-prob:50     299173 ns       298553 ns         2320 bytes_per_second=13.084M/s
BM_DecodeArrowBooleanPlain/DecodeArrowWithNull_Dense/num-values:65536/null-prob:50     615457 ns       613450 ns         1141 bytes_per_second=12.7353M/s

@mapleFU mapleFU requested a review from pitrou April 3, 2024 11:11
cpp/src/parquet/encoding.cc Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
}
};

class BM_DecodeArrowBooleanRle : public BenchmarkDecodeArrowBoolean {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this, you just need some member variable with Encoding::PLAIN or Encoding::RLE, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done yet, PTAL again

cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Show resolved Hide resolved
cpp/src/parquet/encoding_benchmark.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU requested a review from pitrou April 3, 2024 16:34
@mapleFU mapleFU merged commit a4acb64 into apache:main Apr 3, 2024
31 of 33 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Apr 3, 2024
@mapleFU mapleFU deleted the plain-boolean-decoding-arrow branch April 4, 2024 04:02
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a4acb64.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

pitrou pushed a commit that referenced this pull request Apr 8, 2024
…oding (#41037)

### Rationale for this change

Previous patches ( #40876 , #40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: #41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
verma-kartik pushed a commit to verma-kartik/arrow that referenced this pull request Apr 11, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…bitmap) for PlainBooleanDecoder (apache#40876)

### Rationale for this change

This is for enhance boolean decoding. I optimized the `DecodeArrow` for PlainBoolean

### What changes are included in this PR?

Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder, and add benchmarks

### Are these changes tested?

Yes

### Are there any user-facing changes?

Minor optimization. And `Decode` boolean will change the syntax

* GitHub Issue: apache#40872

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: mwish <maplewish117@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…bitmap) for PlainBooleanDecoder (apache#40876)

### Rationale for this change

This is for enhance boolean decoding. I optimized the `DecodeArrow` for PlainBoolean

### What changes are included in this PR?

Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder, and add benchmarks

### Are these changes tested?

Yes

### Are there any user-facing changes?

Minor optimization. And `Decode` boolean will change the syntax

* GitHub Issue: apache#40872

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: mwish <maplewish117@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…bitmap) for PlainBooleanDecoder (apache#40876)

### Rationale for this change

This is for enhance boolean decoding. I optimized the `DecodeArrow` for PlainBoolean

### What changes are included in this PR?

Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder, and add benchmarks

### Are these changes tested?

Yes

### Are there any user-facing changes?

Minor optimization. And `Decode` boolean will change the syntax

* GitHub Issue: apache#40872

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: mwish <maplewish117@gmail.com>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…bitmap) for PlainBooleanDecoder (apache#40876)

### Rationale for this change

This is for enhance boolean decoding. I optimized the `DecodeArrow` for PlainBoolean

### What changes are included in this PR?

Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder, and add benchmarks

### Are these changes tested?

Yes

### Are there any user-facing changes?

Minor optimization. And `Decode` boolean will change the syntax

* GitHub Issue: apache#40872

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: mwish <maplewish117@gmail.com>
rok pushed a commit to tmct/arrow that referenced this pull request May 8, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…bitmap) for PlainBooleanDecoder (apache#40876)

### Rationale for this change

This is for enhance boolean decoding. I optimized the `DecodeArrow` for PlainBoolean

### What changes are included in this PR?

Optimize DecodeArrow/Decode(bitmap) for PlainBooleanDecoder, and add benchmarks

### Are these changes tested?

Yes

### Are there any user-facing changes?

Minor optimization. And `Decode` boolean will change the syntax

* GitHub Issue: apache#40872

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: mwish <maplewish117@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…ow decoding (apache#41037)

### Rationale for this change

Previous patches ( apache#40876 , apache#40995 ) introduce some bugs in boolean decoding. This patch add tests for DecodeArrow and fix the bug

### What changes are included in this PR?

1. Add tests for DecodeArrow with nulls and without nulls
2. Fix 2 bugs

### Are these changes tested?

Yes

### Are there any user-facing changes?

no

* GitHub Issue: apache#41032

Lead-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: mwish <anmmscs_maple@qq.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants