Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] Add validation functionality #37702

Closed
lidavidm opened this issue Sep 13, 2023 · 5 comments · Fixed by #37942
Closed

[Java] Add validation functionality #37702

lidavidm opened this issue Sep 13, 2023 · 5 comments · Fixed by #37942

Comments

@lidavidm
Copy link
Member

lidavidm commented Sep 13, 2023

Describe the enhancement requested

Other implementations can validate that the data in memory meets all the requirements of the Arrow specification (for example: offsets are nondecreasing for strings/lists; data is UTF-8 for strings). This would also be useful in Java (especially since vectors are mutable and it's a known shortcoming that mutators do not necessarily preserve type invariants).

In C++ this is Validate/ValidateFull.

Component(s)

Java

@jduo
Copy link
Member

jduo commented Sep 26, 2023

take

@jduo
Copy link
Member

jduo commented Sep 26, 2023

@lidavidm , I see that there is ValueVectorUtility#validate() and ValueVectorUtility#Validate().
Is the ask to add instance methods to the Vector classes to validate()? Or is it to add validity checks that are implemented in C++ but not Java (the UTF-8 check and offset check mentioned above, and others)?

@lidavidm
Copy link
Member Author

Honestly, I never realized that existed. Thanks for finding those. I think we should do both: add instance methods, and then generally make sure that Java validates the same things C++ does.

@jduo
Copy link
Member

jduo commented Sep 27, 2023

It seems the definition of validate() and validateFull() differ between C++ and Java.

In C++, validate() is an O(k) operation where k is the number of descendants and validateFull() is an O(k*n) operation where n is the length of each entry (in a string or list vector).

Whereas in Java, ValueVectorUtility#validate() is an O(1) operation that does not do any checks that iterate over the vector's fields and ValueVectorUtility#validateFull() checks offsets for string/binary/lists vectors and it drills down into container vectors. Note that Java's validateFull() currently does not do checks such as ensuring Decimal data is valid or ensuring string data is UTF-8

It seems like Java's validateFull() is somewhat similar to C++'s validate(), and Java's validate() is more like validateShallow().

@lidavidm
Copy link
Member Author

If validateFull is already checking offsets, I think it's appropriate to check string data as well? That said if the only difference is validating decimal/utf8 data then perhaps we can punt on that. (Does it validate union type codes/dense union offsets?)

jduo added a commit to jduo/arrow that referenced this issue Sep 28, 2023
Make vector validation more consistent with Array::Validate() in C++:
* Add validate() and validateFull() instance methods to vectors.
* Validate that VarCharVector and LargeVarCharVector contents are
  valid UTF-8.
* Validate that DecimalVector and Decimal256Vector contents fit
  within the supplied precision and scale.
* Validate that NullVectors contain only nulls.
* Validate that FixedSizeBinaryVector values have the correct
  length.
jduo added a commit to jduo/arrow that referenced this issue Sep 28, 2023
Make vector validation more consistent with Array::Validate() in C++:
* Add validate() and validateFull() instance methods to vectors.
* Validate that VarCharVector and LargeVarCharVector contents are
  valid UTF-8.
* Validate that DecimalVector and Decimal256Vector contents fit
  within the supplied precision and scale.
* Validate that NullVectors contain only nulls.
* Validate that FixedSizeBinaryVector values have the correct
  length.
lidavidm pushed a commit that referenced this issue Sep 29, 2023
### Rationale for this change
Make vector validation code more consistent with C++. Add missing checks and have the entry point
be the same so that the code is easier to read/write when working with both languages.

### What changes are included in this PR?
Make vector validation more consistent with Array::Validate() in C++:
* Add validate() and validateFull() instance methods to vectors.
* Validate that VarCharVector and LargeVarCharVector contents are valid UTF-8.
* Validate that DecimalVector and Decimal256Vector contents fit within the supplied precision and scale.
* Validate that NullVectors contain only nulls.
* Validate that FixedSizeBinaryVector values have the correct length.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: #37702

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
@lidavidm lidavidm added this to the 14.0.0 milestone Sep 29, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…che#37942)

### Rationale for this change
Make vector validation code more consistent with C++. Add missing checks and have the entry point
be the same so that the code is easier to read/write when working with both languages.

### What changes are included in this PR?
Make vector validation more consistent with Array::Validate() in C++:
* Add validate() and validateFull() instance methods to vectors.
* Validate that VarCharVector and LargeVarCharVector contents are valid UTF-8.
* Validate that DecimalVector and Decimal256Vector contents fit within the supplied precision and scale.
* Validate that NullVectors contain only nulls.
* Validate that FixedSizeBinaryVector values have the correct length.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37702

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…che#37942)

### Rationale for this change
Make vector validation code more consistent with C++. Add missing checks and have the entry point
be the same so that the code is easier to read/write when working with both languages.

### What changes are included in this PR?
Make vector validation more consistent with Array::Validate() in C++:
* Add validate() and validateFull() instance methods to vectors.
* Validate that VarCharVector and LargeVarCharVector contents are valid UTF-8.
* Validate that DecimalVector and Decimal256Vector contents fit within the supplied precision and scale.
* Validate that NullVectors contain only nulls.
* Validate that FixedSizeBinaryVector values have the correct length.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37702

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dongjoon-hyun pushed a commit to apache/spark that referenced this issue Nov 4, 2023
### What changes were proposed in this pull request?
This pr upgrade Apache Arrow from 13.0.0 to 14.0.0.

### Why are the changes needed?
The Apache Arrow 14.0.0 release brings a number of enhancements and bug fixes.
‎
In terms of bug fixes, the release addresses several critical issues that were causing failures in integration jobs with Spark([GH-36332](apache/arrow#36332)) and problems with importing empty data arrays([GH-37056](apache/arrow#37056)). It also optimizes the process of appending variable length vectors([GH-37829](apache/arrow#37829)) and includes C++ libraries for MacOS AARCH 64 in Java-Jars([GH-38076](apache/arrow#38076)).
‎
The new features and improvements focus on enhancing the handling and manipulation of data. This includes the introduction of DefaultVectorComparators for large types([GH-25659](apache/arrow#25659)), support for extended expressions in ScannerBuilder([GH-34252](apache/arrow#34252)), and the exposure of the VectorAppender class([GH-37246](apache/arrow#37246)).
‎
The release also brings enhancements to the development and testing process, with the CI environment now using JDK 21([GH-36994](apache/arrow#36994)). In addition, the release introduces vector validation consistent with C++, ensuring consistency across different languages([GH-37702](apache/arrow#37702)).
‎
Furthermore, the usability of VarChar writers and binary writers has been improved with the addition of extra input methods([GH-37705](apache/arrow#37705)), and VarCharWriter now supports writing from `Text` and `String`([GH-37706](apache/arrow#37706)). The release also adds typed getters for StructVector, improving the ease of accessing data([GH-37863](apache/arrow#37863)).

The full release notes as follows:
- https://arrow.apache.org/release/14.0.0.html

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43650 from LuciferYang/arrow-14.

Lead-authored-by: yangjie01 <yangjie01@baidu.com>
Co-authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…che#37942)

### Rationale for this change
Make vector validation code more consistent with C++. Add missing checks and have the entry point
be the same so that the code is easier to read/write when working with both languages.

### What changes are included in this PR?
Make vector validation more consistent with Array::Validate() in C++:
* Add validate() and validateFull() instance methods to vectors.
* Validate that VarCharVector and LargeVarCharVector contents are valid UTF-8.
* Validate that DecimalVector and Decimal256Vector contents fit within the supplied precision and scale.
* Validate that NullVectors contain only nulls.
* Validate that FixedSizeBinaryVector values have the correct length.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37702

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…che#37942)

### Rationale for this change
Make vector validation code more consistent with C++. Add missing checks and have the entry point
be the same so that the code is easier to read/write when working with both languages.

### What changes are included in this PR?
Make vector validation more consistent with Array::Validate() in C++:
* Add validate() and validateFull() instance methods to vectors.
* Validate that VarCharVector and LargeVarCharVector contents are valid UTF-8.
* Validate that DecimalVector and Decimal256Vector contents fit within the supplied precision and scale.
* Validate that NullVectors contain only nulls.
* Validate that FixedSizeBinaryVector values have the correct length.

### Are these changes tested?
Yes.

### Are there any user-facing changes?
No.
* Closes: apache#37702

Authored-by: James Duong <duong.james@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants