Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable casting between Utf8/LargeUtf8 and Binary/LargeBinary #3542

Merged
merged 6 commits into from
Jan 25, 2023

Conversation

viirya
Copy link
Member

@viirya viirya commented Jan 17, 2023

Which issue does this PR close?

Closes #1179.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 17, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is sound as it doesn't perform UTF-8 validation on the strings AFAICT.

I think we should implement StringArray <-> BinaryArray and LargeStringArray <-> LargeBinaryArray using the existing From implementations, that among other things avoid re-encoding offsets and perform optimised validation.

We can then implement Binary <-> LargeBinary and StringArray <-> LargeStringArray separately, ideally sharing logic with ListArray <-> LargeListArray

@viirya viirya marked this pull request as draft January 17, 2023 20:33
Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold Sorry for late.

For the question about UTF-8 validation, this change doesn't change casting from binary/large binary to string/large string. I think this is the only scenarios we need to perform UTF-8 validation.

This change only adds:

  1. LargeString to Boolean, Boolean to LargeString
  2. Binary to LargeBinary, LargeBinary to Binary
  3. String to LargeBinary
  4. LargeString to Binary

For above cases, I think we don't need UTF-8 validation.

@viirya viirya marked this pull request as ready for review January 22, 2023 23:41
Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we support Binary to String/LargeString. This patch doesn't touch them. I will extend it to support LargeBinary to String/LargeString in other work.

arrow-cast/src/cast.rs Outdated Show resolved Hide resolved
arrow-cast/src/cast.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can go in as is, I hadn't noticed that cast_byte_container was only being used in "valid" contexts. However, I think we should either make it unsafe, or restrict its signature to be sound prior to merge.

@viirya viirya merged commit bf21ad9 into apache:master Jan 25, 2023
@ursabot
Copy link

ursabot commented Jan 25, 2023

Benchmark runs are scheduled for baseline = d938cd9 and contender = bf21ad9. bf21ad9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support casting to/from DictionaryArray <--> LargeUTF8, Binary and LargeBinary in Cast Kernels
3 participants