Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for decimal batch reader #22636

Merged
merged 1 commit into from
Jul 12, 2024
Merged

Conversation

wypb
Copy link
Contributor

@wypb wypb commented Apr 30, 2024

Description

Add support for decimal batch reader

Benchmark(The lower the better)

Benchmark                                                          (byteArrayLength)  (decimalPrimitiveTypeName)  (enableOptimizedReader)  (nullable)  (writerVersion)  Mode  Cnt    Score    Error  Units
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                     true        true      PARQUET_1_0  avgt   60   35.978 ±  0.213  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                     true        true      PARQUET_2_0  avgt   60   56.711 ±  0.247  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                     true       false      PARQUET_1_0  avgt   60   63.970 ±  0.239  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                     true       false      PARQUET_2_0  avgt   60  104.495 ±  1.744  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                    false        true      PARQUET_1_0  avgt   60   72.606 ±  0.766  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                    false        true      PARQUET_2_0  avgt   60   64.700 ±  1.676  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                    false       false      PARQUET_1_0  avgt   60  119.463 ±  2.066  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A                      BINARY                    false       false      PARQUET_2_0  avgt   60   97.458 ±  5.932  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                     true        true      PARQUET_1_0  avgt   60   19.863 ±  0.125  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                     true        true      PARQUET_2_0  avgt   60   57.809 ±  0.486  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                     true       false      PARQUET_1_0  avgt   60   29.744 ±  0.314  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                     true       false      PARQUET_2_0  avgt   60  100.839 ±  0.838  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                    false        true      PARQUET_1_0  avgt   60   69.015 ±  0.418  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                    false        true      PARQUET_2_0  avgt   60   67.082 ±  2.169  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                    false       false      PARQUET_1_0  avgt   60  113.877 ±  3.237  ns/op
BenchmarkDecimalColumnBatchReader.readLongDecimal                                N/A    FIXED_LEN_BYTE_ARRAY(16)                    false       false      PARQUET_2_0  avgt   60  101.167 ±  4.742  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                     true        true      PARQUET_1_0  avgt   60    5.892 ±  0.081  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                     true        true      PARQUET_2_0  avgt   60    7.909 ±  0.077  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                     true       false      PARQUET_1_0  avgt   60    2.198 ±  0.016  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                     true       false      PARQUET_2_0  avgt   60    6.230 ±  0.190  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                    false        true      PARQUET_1_0  avgt   60   25.590 ±  2.622  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                    false        true      PARQUET_2_0  avgt   60   20.364 ±  0.143  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                    false       false      PARQUET_1_0  avgt   60   16.077 ±  0.222  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT32                    false       false      PARQUET_2_0  avgt   60   20.276 ±  3.783  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                     true        true      PARQUET_1_0  avgt   60   13.379 ±  4.040  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                     true        true      PARQUET_2_0  avgt   60    9.310 ±  0.036  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                     true       false      PARQUET_1_0  avgt   60    3.766 ±  0.035  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                     true       false      PARQUET_2_0  avgt   60    8.813 ±  0.083  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                    false        true      PARQUET_1_0  avgt   60   25.213 ±  0.222  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                    false        true      PARQUET_2_0  avgt   60   21.266 ±  0.068  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                    false       false      PARQUET_1_0  avgt   60   22.471 ±  0.617  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                       INT64                    false       false      PARQUET_2_0  avgt   60   18.181 ±  0.937  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                     true        true      PARQUET_1_0  avgt   60   13.513 ±  1.030  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                     true        true      PARQUET_2_0  avgt   60   31.413 ±  7.525  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                     true       false      PARQUET_1_0  avgt   60   16.384 ±  2.283  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                     true       false      PARQUET_2_0  avgt   60   37.136 ±  0.573  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                    false        true      PARQUET_1_0  avgt   60   38.115 ±  2.961  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                    false        true      PARQUET_2_0  avgt   60   33.686 ±  0.890  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                    false       false      PARQUET_1_0  avgt   60   38.408 ±  0.306  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A                      BINARY                    false       false      PARQUET_2_0  avgt   60   39.839 ±  0.300  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                     true        true      PARQUET_1_0  avgt   60    6.254 ±  0.055  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                     true        true      PARQUET_2_0  avgt   60   18.820 ±  0.159  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                     true       false      PARQUET_1_0  avgt   60    3.435 ±  0.110  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                     true       false      PARQUET_2_0  avgt   60   29.901 ±  0.606  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                    false        true      PARQUET_1_0  avgt   60   31.011 ±  0.310  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                    false        true      PARQUET_2_0  avgt   60   35.836 ±  3.396  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                    false       false      PARQUET_1_0  avgt   60   36.312 ±  0.709  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimal                               N/A     FIXED_LEN_BYTE_ARRAY(8)                    false       false      PARQUET_2_0  avgt   60   61.468 ± 15.297  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                     true        true      PARQUET_1_0  avgt   60    0.784 ±  0.046  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                     true        true      PARQUET_2_0  avgt   60    6.970 ±  0.188  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                     true       false      PARQUET_1_0  avgt   60    0.842 ±  0.048  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                     true       false      PARQUET_2_0  avgt   60   12.153 ±  5.229  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                    false        true      PARQUET_1_0  avgt   60   39.380 ±  4.667  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                    false        true      PARQUET_2_0  avgt   60   21.779 ±  1.257  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                    false       false      PARQUET_1_0  avgt   60   33.686 ±  0.484  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  1                         N/A                    false       false      PARQUET_2_0  avgt   60   18.937 ±  0.199  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                     true        true      PARQUET_1_0  avgt   60    1.316 ±  0.030  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                     true        true      PARQUET_2_0  avgt   60   27.600 ±  0.792  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                     true       false      PARQUET_1_0  avgt   60    1.307 ±  0.032  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                     true       false      PARQUET_2_0  avgt   60   27.122 ±  0.425  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                    false        true      PARQUET_1_0  avgt   60   33.514 ±  0.238  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                    false        true      PARQUET_2_0  avgt   60   43.301 ±  1.226  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                    false       false      PARQUET_1_0  avgt   60   33.281 ±  0.459  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  2                         N/A                    false       false      PARQUET_2_0  avgt   60   39.255 ±  0.485  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                     true        true      PARQUET_1_0  avgt   60    1.710 ±  0.022  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                     true        true      PARQUET_2_0  avgt   60   28.296 ±  0.558  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                     true       false      PARQUET_1_0  avgt   60    1.813 ±  0.094  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                     true       false      PARQUET_2_0  avgt   60   27.826 ±  0.314  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                    false        true      PARQUET_1_0  avgt   60   36.132 ±  1.231  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                    false        true      PARQUET_2_0  avgt   60   41.832 ±  1.558  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                    false       false      PARQUET_1_0  avgt   60   34.004 ±  0.396  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  3                         N/A                    false       false      PARQUET_2_0  avgt   60   38.750 ±  0.302  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                     true        true      PARQUET_1_0  avgt   60    2.203 ±  0.043  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                     true        true      PARQUET_2_0  avgt   60   28.282 ±  0.533  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                     true       false      PARQUET_1_0  avgt   60    2.149 ±  0.029  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                     true       false      PARQUET_2_0  avgt   60   28.413 ±  0.598  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                    false        true      PARQUET_1_0  avgt   60   34.270 ±  0.404  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                    false        true      PARQUET_2_0  avgt   60   39.215 ±  0.549  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                    false       false      PARQUET_1_0  avgt   60   36.183 ±  0.952  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  4                         N/A                    false       false      PARQUET_2_0  avgt   60   40.536 ±  0.995  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                     true        true      PARQUET_1_0  avgt   60    2.498 ±  0.101  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                     true        true      PARQUET_2_0  avgt   60   28.847 ±  0.456  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                     true       false      PARQUET_1_0  avgt   60    2.499 ±  0.083  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                     true       false      PARQUET_2_0  avgt   60   29.880 ±  0.273  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                    false        true      PARQUET_1_0  avgt   60   34.867 ±  0.772  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                    false        true      PARQUET_2_0  avgt   60   41.475 ±  1.487  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                    false       false      PARQUET_1_0  avgt   60   38.280 ±  4.063  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  5                         N/A                    false       false      PARQUET_2_0  avgt   60   39.719 ±  0.523  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                     true        true      PARQUET_1_0  avgt   60    2.642 ±  0.061  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                     true        true      PARQUET_2_0  avgt   60   29.860 ±  0.367  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                     true       false      PARQUET_1_0  avgt   60    2.663 ±  0.042  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                     true       false      PARQUET_2_0  avgt   60   29.849 ±  0.626  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                    false        true      PARQUET_1_0  avgt   60   35.169 ±  0.158  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                    false        true      PARQUET_2_0  avgt   60   40.821 ±  0.510  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                    false       false      PARQUET_1_0  avgt   60   38.378 ±  1.814  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  6                         N/A                    false       false      PARQUET_2_0  avgt   60   44.101 ±  4.470  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                     true        true      PARQUET_1_0  avgt   60    2.788 ±  0.036  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                     true        true      PARQUET_2_0  avgt   60   29.854 ±  0.341  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                     true       false      PARQUET_1_0  avgt   60    2.868 ±  0.052  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                     true       false      PARQUET_2_0  avgt   60   30.821 ±  0.544  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                    false        true      PARQUET_1_0  avgt   60   38.564 ±  1.608  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                    false        true      PARQUET_2_0  avgt   60   41.938 ±  0.737  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                    false       false      PARQUET_1_0  avgt   60   37.818 ±  2.376  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  7                         N/A                    false       false      PARQUET_2_0  avgt   60   86.133 ± 29.822  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                     true        true      PARQUET_1_0  avgt   60    4.066 ±  0.423  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                     true        true      PARQUET_2_0  avgt   60   34.223 ±  3.815  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                     true       false      PARQUET_1_0  avgt   60    4.593 ±  0.343  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                     true       false      PARQUET_2_0  avgt   60   30.767 ±  1.049  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                    false        true      PARQUET_1_0  avgt   60   50.825 ±  7.990  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                    false        true      PARQUET_2_0  avgt   60   55.074 ±  6.835  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                    false       false      PARQUET_1_0  avgt   60   43.135 ±  9.506  ns/op
BenchmarkDecimalColumnBatchReader.readShortDecimalByteArrayLength                  8                         N/A                    false       false      PARQUET_2_0  avgt   60   48.081 ±  5.818  ns/op

Impact

When we enable Parquet batch reader(parquet_batch_read_optimization_enabled=true), the Decimal type will read data in Batch mode.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Hive Connector Changes
* Add support for decimal batch reader :pr:`22636`

CC: @zhenxiao

@wypb wypb requested review from shangxinli and a team as code owners April 30, 2024 04:02
@wypb wypb requested a review from presto-oss April 30, 2024 04:02
Copy link

linux-foundation-easycla bot commented Apr 30, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: wypb / name: wypb (cabc10c)

@wypb wypb force-pushed the decimal_vector branch 6 times, most recently from b0972c0 to 96f2271 Compare April 30, 2024 13:26
@tdcmeehan tdcmeehan self-assigned this Apr 30, 2024
@tdcmeehan tdcmeehan requested a review from zhenxiao April 30, 2024 14:12
Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, @wypb
looks nice. a few comments about comments, and code structure

switch (length) {
case 8:
value |= bytes[startOffset + 7] & 0xFFL;
// fall through
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line comment seems not useful
same for following lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

DecimalLogicalTypeAnnotation decimalLogicalTypeAnnotation = (DecimalLogicalTypeAnnotation) logicalTypeAnnotation;
return decimalLogicalTypeAnnotation.getPrecision() <= Decimals.MAX_SHORT_PRECISION;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import Decimals.MAX_SHORT_PRECISION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

public class ShortDecimalFixedWidthByteArrayBatchDecoder
{
private static final ShortDecimalDecoder[] VALUE_DECODERS = new ShortDecimalDecoder[] {
new BigEndianReader1(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need 7 readers? add a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually further optimizes the reading speed of short decimals. The implementation of ShortDecimalFixedWidthByteArrayBatchDecoder actually refers to the implementation of Trino: trinodb/trino@f71a815

import static com.facebook.presto.parquet.ParquetTypeUtils.getShortDecimalValue;
import static com.google.common.base.Preconditions.checkArgument;

public class BinaryShortDecimalDeltaValuesDecoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we have an abstract class, where BinaryLongDecimalDeltaValuesDecoder and BinaryShortDecimalDeltaValuesDecoder are its subclasses?
The code of BinaryLongDecimalDeltaValuesDecoder and BinaryShortDecimalDeltaValuesDecoder has lots of same code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see how to refactor these two classes because they implement different interfaces.

import static com.google.common.base.Preconditions.checkArgument;
import static java.util.Objects.requireNonNull;

public class Int64ShortDecimalDeltaValuesDecoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we have an abstract class, and Int32ShortDecimalDeltaValuesDecoder, Int64ShortDecimalDeltaValuesDecoder becomes its sub-classes? The classes share many same code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common parts of Int32ShortDecimalDeltaValuesDecoder and Int64ShortDecimalDeltaValuesDecoder have been extracted into AbstractInt64AndInt32ShortDecimalDeltaValuesDecoder.

import static io.airlift.slice.SizeOf.SIZE_OF_LONG;
import static java.util.Objects.requireNonNull;

public class LongDecimalApacheParquetValueDecoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad naming. LongDecimalApacheParquetValueDecoder why Apache and Parquet appears in the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed LongDecimalApacheParquetValueDecoder and ShortDecimalApacheParquetValueDecoder to FixedLenByteArrayShortDecimalDeltaValueDecoder and FixedLenByteArrayLongDecimalDeltaValueDecoder respectively.

import static com.google.common.base.Preconditions.checkArgument;
import static java.util.Objects.requireNonNull;

public class ShortDecimalApacheParquetValueDecoder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad naming

Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wypb Thank you so much for this great work. I wonder if you could add tests?

return decimalLogicalTypeAnnotation.getPrecision() <= Decimals.MAX_SHORT_PRECISION;
}

public static boolean isLongDecimalType(ColumnDescriptor descriptor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already removed.

int inputBytesOffset = input.getByteArrayOffset();
for (int i = offset; i < offset + length; i++) {
checkBytesFitInShortDecimal(inputBytes, inputBytesOffset, extraBytesLength, columnDescriptor);
values[i] = getShortDecimalValue(inputBytes, inputBytesOffset + extraBytesLength, Long.BYTES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if extraBytesLength < 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the implementation of the code, extraBytesLength will not be less than 0, but must be greater than 0.

if (typeLength <= Long.BYTES) {
    ....        
}
int extraBytesLength = typeLength - Long.BYTES;

// Equivalent to expectedValue = bytes[endOffset] < 0 ? -1 : 0
byte expectedValue = (byte) (bytes[endOffset] >> 7);
for (int i = offset; i < endOffset; i++) {
if (bytes[i] != expectedValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how this works? Suppose inputBytesOffset = 0 and typeLength=9 here, then your extraBytesLength = 1, and checkBytesFitInShortDecimal(inputBytes, 0, 1, descriptor) is called. And your expectedValue is to check if inputBytes[1] < 0 ? -1 : 0. Since the values are encoded big-endian byte order, I assume you wanted to check the the most significant byte which should be inputBytes[0], but you're checking the second most significant byte inputBytes[1]. Does that work?

Also The largest precision for short decimal is 18 and the value is 999,999,999,999,999,999. It can be expressed with 60 bits value 0xDE0B6B3A763FFFF. If you really need to verify there is no overflow, the bits 61-64 also need to be checked. I don't see it's done here. Could you explain a little bit your idea?

Copy link
Contributor Author

@wypb wypb May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bytes[endOffset] of the above code is actually the most significant byte. To illustrate this better, I built a test locally. The original data is 123456789012.12345678, typeLength=14. the contents of bytes are [2, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0, 0, -85, 84, -87, -116, -23, -53, -11, 78]

NOTE: -85, 84, -87, -116, -23, -53, -11, 78 converted to binary is actually the high 56-bit data of 12345678901212345678 converted to binary.

-85, 84, -87, -116, -23, -53, -11, 78 binary representation:
10101011010101001010100110001100111010011100101111110101
12345678901212345678 binary representation:
1010101101010100101010011000110011101001110010111111010101001110

In this scenario, inputBytesOffset = 6, extraBytesLength = 6, so endOffset = 12, bytes[endOffset] = -85, which is actually the most significant byte. Then we check whether bytes[6.. 11] is -1. Attached is the file I tested.
86216465-ad5d-4acc-bc8b-0d1972149d8c.tgz

@wypb wypb force-pushed the decimal_vector branch 2 times, most recently from 6a1eda5 to 49c8f19 Compare May 9, 2024 11:00
Copy link

github-actions bot commented May 9, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff 9898886...cabc10c.

No notifications.

Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems short on new tests. I would have expected quite a number given all the new code.


for (int i = 0; i < bytes.length; i++) {
value |= ((long) bytes[bytes.length - i - 1] & 0xFFL) << (8 * i);
public static long getShortDecimalValue(byte[] bytes, int startOffset, int length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be private or not public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this method is used in many places.

return data;
}

private static int propagateSignBit(int value, int bitsToPad)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why here and in ByteUtils?

Copy link
Contributor Author

@wypb wypb May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the code and found that this function can actually be deleted.

@@ -61,4 +61,9 @@ public static void unpack8Values(byte inByte, byte[] out, int outPos)
out[6 + outPos] = (byte) (inByte >> 6 & 1);
out[7 + outPos] = (byte) (inByte >> 7 & 1);
}

public static long propagateSignBit(long value, int bitsToPad)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could use a unit test that addresses it directly

public SimpleSliceInputStream(Slice slice, int offset)
{
this.slice = requireNonNull(slice, "slice is null");
checkArgument(slice.length() == 0 || slice.hasByteArray(), "SimpleSliceInputStream supports only slices backed by byte array");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only supports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@wypb
Copy link
Contributor Author

wypb commented May 9, 2024

@zhenxiao @yingsu00 @elharo sorry for the late reply. I have made modifications according to the previous review comments.

I wonder if you could add tests?

I generated Parquet files with different encodings locally. I will see how to add them to the test set.
In addition, BenchmarkShortDecimalColumnReader and BenchmarkLongDecimalColumnReader have different coding data verification logic in batch mode and non-batch mode.

@steveburnett
Copy link
Contributor

Nit, suggest release note entry change:

== RELEASE NOTES ==

Hive Connector Changes
* Add support decimal batch reader :pr:`22636`

@wypb wypb changed the title Add support decimal batch reader Add support for decimal batch reader May 10, 2024
@wypb wypb force-pushed the decimal_vector branch 2 times, most recently from 153aa02 to c182a88 Compare May 10, 2024 03:04
@wypb
Copy link
Contributor Author

wypb commented May 10, 2024

@zhenxiao @yingsu00 I added three new tests in AbstractTestParquetReader.java: testLongDecimalBackedByBinary, testShortDecimalBackedByBinary and testShortDecimalBackedByFixedLenByteArray, plus the previous testDecimalBackedByFixedLenByteArray (already named to testLongDecimalBackedByFixedLenByteArray), testDecimalBackedByINT64 and DecimalBackedByINT32 test has been able to cover all the Decoder I added this time.

@wypb wypb force-pushed the decimal_vector branch 2 times, most recently from 7fa98f4 to fd3a640 Compare May 10, 2024 03:42
@wypb
Copy link
Contributor Author

wypb commented May 22, 2024

@zhenxiao @yingsu00 Any comments on this?

zhenxiao
zhenxiao previously approved these changes May 22, 2024
Copy link
Collaborator

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Copy link
Contributor

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of new public API here that feels like it needs unit tests, e.g. SimpleSliceInputStream

@tdcmeehan
Copy link
Contributor

@wypb could you add more tests to address @elharo's feedback?

@wypb
Copy link
Contributor Author

wypb commented Jun 6, 2024

HI @tdcmeehan @elharo Thank you for your review, I'll add some test cases soon.

Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wypb Will you please add the tests @elharo requested? For example, you can reference TestBooleanStream for writing tests for SimpleSliceInputStream. Thank you!

@@ -173,11 +234,46 @@ private static final ValuesDecoder createValuesDecoder(ColumnDescriptor columnDe

if ((encoding == DELTA_BYTE_ARRAY || encoding == DELTA_LENGTH_BYTE_ARRAY) && type == PrimitiveTypeName.BINARY) {
ByteBufferInputStream inputStream = ByteBufferInputStream.wrap(ByteBuffer.wrap(buffer, offset, length));

Optional<Type> prestoType = createDecimalType(columnDescriptor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is from existing code, but it is confusing to detect if the logical type is decimal by calling createDecimalType() and check its returned type. It'll be clearer to make the type detecting consistent with other encodings, e.g. as what you did on line 195, 196

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already refactored, thanks.


public FixedLenByteArrayShortDecimalPlainValuesDecoder(ColumnDescriptor columnDescriptor, byte[] byteBuffer, int bufferOffset, int length)
{
this.columnDescriptor = columnDescriptor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the existing decoders code has the same pattern, but can you please add requireNonNull() on columnDescriptor and byteBuffer? Same for all other constructors for the classes you add. Thanks!

@wypb
Copy link
Contributor Author

wypb commented Jul 9, 2024

HI @elharo @tdcmeehan @yingsu00 thanks for your inputs and sorry for the delay.

I modified some code according to the previous review and added a new test case com.facebook.presto.parquet.batchreader.TestSimpleSliceInputStream. I have rebased the code and the diff file can be found in d527a91.

throws Exception
{
for (int precision = 1; precision <= MAX_SHORT_PRECISION; precision++) {
int scale = ThreadLocalRandom.current().nextInt(precision);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get very nervous seeing random in tests. If something fails it's not reproducible, Use fixed constant test data instead, even if the fixed values are initially randomly chosen.

throws Exception
{
for (int precision = 1; precision <= MAX_SHORT_PRECISION; precision++) {
int scale = ThreadLocalRandom.current().nextInt(precision);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

if (!isNested && isShortDecimalType(descriptor)) {
int precision = ((DecimalLogicalTypeAnnotation) descriptor.getPrimitiveType().getLogicalTypeAnnotation()).getPrecision();
if (precision < 10) {
log.warn("PrimitiveTypeName is INT64 but precision is less then 10.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete the warning; it's not actionable

or if I'm wrong and this is a real problem, then a waring isn't enough. This should fail outright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says we need to produce a warning when precision < 10 for INT64. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

* is not a common one, just use the existing one provided by Parquet library and add a wrapper around it that satisfies the
* {@link ShortDecimalValuesDecoder} interface.
*/
public class FixedLenByteArrayShortDecimalDeltaValueDecoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid abbreviations; thus FixedLen --> FixedLength

but aren't all arrays in java fixed length? So maybe just ByteArrayShortDecimalDeltaValueDecoder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming convention of the Decoder class name here is ParquetPrimitiveType + [Short|Long]Decimal + encoding + ValuesDecoder. The PrimitiveTypeName corresponding to this class is FIXED_LEN_BYTE_ARRAY, so this name is used.

int positionOffset = offsets[i];
int positionLength = offsets[i + 1] - positionOffset;
byte[] temp = new byte[positionLength];
System.arraycopy(byteBuffer, positionOffset, temp, 0, positionLength);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I find Arrays.copyOf a little easier to read; up to you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to byte[] temp = Arrays.copyOfRange(byteBuffer, positionOffset, positionOffset + positionLength);

@Override
public void readNext(long[] values, int offset, int length)
{
final byte[] localByteBuffer = byteBuffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Danger! Since this local variable is just a reference, byteBuffer is modified anyway when localByteBuffer is. Code below might be correct, I'm not sure, but this variable should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

byte expectedValue = (byte) (bytes[endOffset] >> 7);
for (int i = offset; i < endOffset; i++) {
if (bytes[i] != expectedValue) {
throw new PrestoException(NOT_SUPPORTED, format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string concatenation is simpler here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// We first shift the byte as left as possible. Then, when shifting back right,
// the sign bit will get propagated
values[offset] = value << 56 >> 56;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order of operations is foggy. Please use parentheses to make this explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

long[] result = new long[size];
for (int i = 0; i < size; i++) {
result[i] = Math.max(
Math.min(randomLong(random, bitWidth), max),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid random numbers in tests. Test results need to be reproducible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@wypb
Copy link
Contributor Author

wypb commented Jul 10, 2024

Thank you @elharo for the review! I have updated with a new (squashed) commit, and I have addressed all your comments. Let me know if I missed anything.

@yingsu00
Copy link
Contributor

@wypb There is a test failure: Failures:
[ERROR] TestPrestoNativeIcebergTpcdsQueriesParquetUsingThrift.doDeletesAndQuery:399->verifyDeletes:390->AbstractTestQueryFramework.computeScalar:149->AbstractTestQueryFramework.computeActual:139 ? Runtime size <= capacity_ (18446744073709551176 vs. 1440) Split [Hive: file:/tmp/iceberg_data/HIVE/tpcds/store_sales/data/183baea9-babf-4daa-875b-2148de383c5a.parquet 0 - 2071914] Task 20240710_023723_00114_2m4dj.1.0.2.0 Operator: TableScan[0] 0

Can you please investigate? THanks!

@tdcmeehan
Copy link
Contributor

@yingsu00 this is actually likely due to facebookincubator/velox#10261 and is being backed out in facebookincubator/velox#10431. It was added in #23138 (which appeared to have been merged in spite of the failure it introduced). See: #23156

@wypb
Copy link
Contributor Author

wypb commented Jul 11, 2024

Hi @tdcmeehan thank you for sharing the information. @yingsu00 I synchronized the latest code, and now CI is all green.

@yingsu00
Copy link
Contributor

@yingsu00 this is actually likely due to facebookincubator/velox#10261 and is being backed out in facebookincubator/velox#10431. It was added in #23138 (which appeared to have been merged in spite of the failure it introduced). See: #23156

Thank you @tdcmeehan for letting me know!

@yingsu00 yingsu00 merged commit d76c2d3 into prestodb:master Jul 12, 2024
56 checks passed
@wypb wypb deleted the decimal_vector branch July 12, 2024 22:55
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants